Stay connected to Children’s Health!
Researchers realize it is possible to miss real effects by not collecting enough data. One might miss a viable medicine or fail to notice an important side effect. So how do we know how much data to collect?
The concept of statistical power provides the answer. The power of a study is the probability that it will distinguish an effect of a certain size from pure luck. A study might easily detect a huge benefit from a medication, but detecting a subtle difference is much less likely.
The power for any hypothesis test is the probability that it will yield a statistically significant outcome (commonly defined as p < 0:05). The power is affected by three factors:
The size of the effect you’re looking for. A huge effect size is much easier to detect than a tiny one.
The sample size. By collecting more data, one can more easily detect small biases.
Measurement error. Many experiments deal with values that are harder to measure, such as medical studies investigating symptoms of fatigue or depression.
More data helps distinguish the signal from the noise. But this is easier said than done: many scientists don’t have the resources to conduct studies with adequate statistical power to detect what they’re looking for. They are doomed to fail before they even start.
Consider a trial testing two different medicines, A and B, for the same condition. Investigators want to know which is safer, but side effects are rare. So even if there are 100 patients, only a few in each group will suffer serious side effects. The difference between a 3% and 4% side effect rate is difficult to discern. If four people taking drug A have serious side effects and only three people taking drug B have them, researchers cannot say for sure whether the difference is caused by the drug difference. If a trial isn’t powerful enough to detect the effect it’s looking for, we say it is underpowered.
You might think calculations of statistical power are essential for medical trials; a scientist might want to know how many patients are needed to test a new medication, and a quick calculation of statistical power would provide the answer. Scientists are usually satisfied when the statistical power is 0.8 or higher, corresponding to an 80% chance of detecting a real effect of the expected size. (If the true effect is actually larger, the study will have greater power.)
However, few scientists ever perform this calculation, and few journal articles even mention statistical power. In the prestigious journals Science and Nature, fewer than 3% of articles calculate statistical power before starting their study. Indeed, many trials conclude that “there was no statistically significant difference in adverse effects between groups,” without noting that there was insufficient data to detect any but the largest differences. If one of these trials was comparing side effects in two drugs, a doctor might erroneously think the medications are equally safe, when one could very well be much more dangerous than the other.
Maybe this is a problem only for rare side effects or only when a medication has a weak effect? No. Bedard published his findings in article “Statistical Power of Negative Randomized Controlled Trials Presented at American Society for Clinical Oncology Annual Meetings”: only about half of published studies with negative results had enough statistical power to detect even a large difference in their primary outcome variable. Less than 10% of these studies explained why their sample sizes were so poor. Similar problems have been consistently seen in other fields of medicine, such as emergency medicine and surgery. In neuroscience, the problem is even worse. Each individual neuroscience study collects such little data that the median study has only a 20% chance of being able to detect the effect it is looking for.
An ethical review board should not approve a trial if it knows the trial is unable to detect the effect it is looking for.
So why are power calculations often forgotten? One reason is the discrepancy between our intuitive feeling about sample sizes and the results of power calculations. It’s easy to think, “Surely these are enough test subjects,” even when the study has abysmal power. For example, suppose investigators are testing a new heart attack treatment protocol and hope to cut the risk of death in half, from 20% to 10%. They might be inclined to think, “If we don’t see a difference when we try this procedure on 50 patients, clearly the benefit is too small to be useful.” But to have 80% power to detect the effect, investigators actually need 400 patients—200 in each control and treatment group. Perhaps clinicians just do not realize that their seemingly adequate sample sizes are in fact far too small.
The perils of insufficient power do not mean that scientists are lying when they state they detected no significant difference between groups. But it’s misleading to assume these results mean there is no real difference. There may be a difference, even an important one, but the study was too small to notice it.
More useful than a statement that an experiment’s results were statistically insignificant is a confidence interval giving plausible sizes for the effect. Even if the confidence interval includes zero, its width reveals a lot: a narrow interval covering zero indicates that the effect is most likely small, while a wide interval clearly shows that the measurement was not precise enough to draw conclusions. Thinking about results in terms of confidence intervals provides a new way to approach experimental design. Instead of focusing on the power of significance tests, ask, “How much data must I collect to measure the effect to my desired precision?” Even a powerful experiment can nonetheless produce significant results with extremely wide confidence intervals, making its results difficult to interpret.
Below are some tips for statistical power analysis.
Stay connected to Children’s Health!