The benefit of the doubt
An experiment is carried out to learn something. Usually that experiment is not the only source of evidence. There may have been earlier experiments with similar goals. There is usually some prior knowledge about what is tested. The experiment itself provides additional information but that information is never standing on its own. In the context of biological research, the experiment is usually set up to learn about a certain hypothesis or to estimate a certain parameter.
In the case of a hypothesis that needs testing, the outcome of the analysis of the data will contain some statement about how much evidence there is for or against that hypothesis. This statement helps the researcher to judge how compelling the new evidence is to support or refute the hypothesis, which will direct him or her to the next steps of the research. In the case of a parameter estimation, the output will contain the magnitude of the estimated parameter and how precise that estimate is. Here the researcher needs to judge whether the magnitude is sufficiently large and whether there is enough certainty about that magnitude, to take a decision.
Seldom the outcome of that experiment will be the only consideration in such decisions. As said already, there is usually some form of prior knowledge. There will be always doubts about the quality or representativeness of the experiment or analysis (Did we do everything right in the experiment? Are the experimental conditions representative for the eventual real situation? Is there indication that the experimental conditions biased the outcome? Was it justified to remove these outliers? Were the assumptions we needed to take for the analysis realistic?). Another factor influencing a decision is the consequences of a decision. Its impact may be huge or it could be less important. Or the consequence of a wrong decision maybe be different wether rejecting something real or accepting something wrong.
In other words, experiments never give guaranteed and absolute answers. They pile another piece of information (more or less trusted) on top of other uncertain information. That may or may not influence a researcher to take a particular decision. Experiments and the output of the data analysis are a part of a larger knowledge accumulation process. They requiring careful thinking.
Although this sounds logic and gradually building knowledge is an intrinsic part of all learning about new things or situations, it is surprising that researchers do hope that the data analysis of a particular experiment would give them a clear cut answer on what their decision about next steps should be (If a result is significant we do this, if not we do that). The reasons for this may be multiple. One is the poor understanding of the inference allowed by the statistical test they carry out, even when doing rather simple tests. Another reason is the false feel of comfort and ease-of-mind provided by firm statistical statements which are commonly accepted or even required in the scientific world. They think that statements as `the effect is not significant at p < 0.05´ or `the correlation of r=0.2 is small but significant (p < 0.01)´ provide the necessary statistical rigor and authority to underline the validity and importance of their findings; that it makes their message clear and irrefutable.
Besides the fact that this may lead to wrong decisions, the biggest danger of relying on the simplified output of certain test prevents the thinking. They block the reasoning about fundamental science and research questions and the value of the evidence collected in the experiment to answer them.
Many statisticians have made this criticism and phrased it more eloquently then we are able to do:
Testing for statistical significance continues today not on its merits as
a methodological tool but on the momentum of tradition. Rather than serving
as a thinker’s tool, it has become for some a clumsy substitute for thought,
subverting what should be a contemplative exercise into an algorithm prone to
error. —K. J. Rothman (1986)
We all have been taught that experimental research progresses via the formulation of a hypothesis which is then either confirmed or rejected. Usually the researcher has an idea that a certain factor causes a difference, she formulates the hypothesis that there would be no difference when applying that factor, and then hopes that the data would eventually reject that hypothesis. This is a rather convoluted approach, but it is certainly pushed in that direction by classical Null Hypothesis Tests and the P-values that come with it. The Null hypothesis test calculates how extreme the value of a particular metric (a t-value, an F-value, a correlation...) obtained from the actual data is in a population of this metric as it would be derived from similar data sets but in the absence of the difference (or correlation). The P-value indicates that degree of extremity. When it is very extreme, the null hypothesis is unlikely and hence the researcher considers the existence of the difference is proven.
When the hypothesis is about comparing two treatments, everybody would agree that two can never be completely equal, even if they would differ only due to minor details. So proving the null hypothesis wrong is rather a matter of having a sensitive enough trial. As long as the experiment is large enough, almost any hypothesis of no difference can be rejected. Likewise, large differences may be not retained as significant because measured in a underpowered trial. The power of the trial and size of difference come into play for these null hypothesis test, which makes the outcome of the test more ambiguous as people may think. You reject the null hypothesis or you fail to reject the null hypothesis. The latter outcome leaves ample options for next steps. Additionally, you reject the null hypothesis because a metric is very extreme (a common requirement is that it belongs to the 5% most extreme outcomes), but that does not mean it is absolutely impossible.
The good thing about this P-value from this t-test is that it will be consistent whether you take samples of size 25 instead of size 15, or whether you increased the noise from standard deviation 12 to 20. So the probabilities of hitting false significant results (the famous type I error) remain the same as long as you apply the same alpha-value, regardless of variability in the trial or the number of replicates. This is no longer the case when we change our focus from specificity (how good is the test in correctly stating the two treatments are not different) to sensitivity (how good is the test is correctly stating the treatments are different).
In contrast to the specificity, the sensitivity or power depends on sample size and variability (and obviously the size of effect one hopes to detect). Increasing the number of samples increases the power. Increasing the variability (standard deviation) decreases the power. This all has its importance in the right-sizing of experiments, but it also has its influence on the proportion of true different results to false different result in an outcome where many differences are tested.
The correct prediction or wrong predictions in statistical tests have their own vocabulary. They are classified according whether what prediction is made and whether that prediction matches reality. Two errors that can be made are either to predict something while there is nothing happening (false positive, type I error) or predicting nothing is happening while in reality there is an effect (false negative, type II error).
Any alternative ?
P-values are so deep-rooted that it almost impossible to avoid them. To avoid that people fall in the pitfall of relying too much on them, clearly report effect size and give ample descriptive statistic. While interpreting the results of an experiment a good overview of effect sizes and their relative importance, distributions and ranges of the observations are needed. Those descriptive statistics can be in table form but ofter more conveniently displayed in graphs.
Another useful addition is to quote what the the smallest difference is that the trial was expected to be able to detect with a given alpha-level. This allows the audience to judge whether the absence of significance is due to the absence of effect or due to a poorly designed experiment.
Bayesian forms of data-analysis do also provide richer output that maybe helpful to judge better what the data actually tell you. If prior knowledge is available (e.g. on the prevalence of effects) Bayesian methods allow to incorporate it in the data analysis, what may help to reduce false positive numbers.
A commonly given advise is to report estimates with confidence intervals rather then P-values (or stars, or abcd's) and to show these in a graphical way. Confidence intervals are technically very related to hypothesis testing so they share some of the issues, but they have the advantage it draws people attention to magnitudes of effects and uncertainties about this estimates. A problem of confidence intervals is that several types of these intervals or error bars exist and that people get confused between them and do not know how precisely to interpret them. But that is the topic of another blog...