Randomization, independence and pseudo-replication

In a randomized controlled trial, test subjects are assigned to either experimental or control groups randomly, rather than for any systematic reason. A medical trial is not usually considered definitive unless it is a randomized controlled trial. Why? What’s so important about randomization?

Randomization prevents researchers from introducing systematic biases between test groups. Otherwise, they might assign frail patients to a less risky or less demanding treatment or assign wealthier patients to the new treatment because their insurance companies will pay for it. But randomization has no hidden biases, and it guarantees that each group has roughly the same demographics; any confounding factors—even ones you don’t know about—can’t affect your results. When you obtain a statistically significant result, you know that the only possible cause is your medication or intervention.

Let me return to a medical example. I want to compare two blood pressure medications, so I recruit 2,000 patients and randomly split them into two groups. Then I administer the medications. After waiting a month for the medication to take effect, I measure each patient’s blood pressure and compare the groups to find which has the lower average blood pressure.

I can do an ordinary hypothesis test and get an ordinary p value; with my sample size of 1,000 patients per group, I will have good statistical power to detect differences between the medications.

Now imagine an alternative experimental design. Instead of 1,000 patients per group, I recruit only 10, but I measure each patient’s blood pressure 100 times over the course of a few months. This way I can get a more accurate fix on their individual blood pressures, which may vary from day to day. Or perhaps I’m worried that my sphygmomanometers are not perfectly calibrated, so I measure with a different one each day. I still have 1,000 data points per group but only 10 unique patients. I can perform the same hypothesis tests with the same statistical power since I seem to have the same sample size.

But do I really? A large sample size is supposed to ensure that any differences between groups are a result of my treatment, not genetics or preexisting conditions. But in this new design, I’m not recruiting new patients. I’m just counting the genetics of each existing patient 100 times.

This problem is known as pseudo-replication, and it is quite common. For instance, after testing cells from a culture, a biologist might “replicate” his results by testing more cells from the same culture. Or a neuroscientist might test multiple neurons from the same animal, claiming to have a large sample size of hundreds of neurons from just two rats. A marine biologist might experiment on fish kept in aquariums, forgetting that fish sharing a single aquarium are not independent: their conditions may be affected by one another, as well as the tested treatment. If these experiments are meant to reveal trends in rats or fish in general, their results will be misleading.

You can think of pseudo-replication as collecting data that answers the wrong question. Pseudo-replication can also be caused by taking separate measurements of the same subject over time (autocorrelation), like in my blood pressure experiment. Blood pressure measurements of the same patient from day to day are auto correlated, as are revenue figures for a corporation from year to year. The mathematical structure of these autocorrelations can be complicated and vary from patient to patient or from business to business. The unwitting scientist who treats this data as though each measurement is independent of the others will obtain pseudo-replicated—and hence misleading—results.

Careful experimental design can break the dependence between measurements. Alternatively, if you can’t alter your experimental design, statistical analysis can help account for pseudo-replication. Statistical techniques do not magically eliminate dependence between measurements or allow you to obtain good results with poor experimental design. They merely provide ways to quantify dependence so you can correctly interpret your data.

Here are some tips:

• Average the dependent data points. For example, average all the blood pressure measurements taken from a single person and treat the average as a single data point.

This isn’t perfect: if you measured some patients more frequently than others, this fact won’t be reflected in the averaged number. To make your results reflect the level of certainty in your measurements, which increases as you take more, you’d perform a weighted analysis, weighting the better-measured patients more strongly.

• Analyze each dependent data point separately. Instead of combining all the patient’s blood pressure measurements, analyze every patient’s blood pressure from, say, just day five, ignoring all other data points.

• Correct for the dependence by adjusting your p values and confidence intervals. Many procedures exist to estimate the size of the dependence between data points and account for it, including clustered standard errors, repeated measures tests, and hierarchical models.

• Ensure that your statistical analysis really answers your research question. Additional measurements that are highly dependent on previous data do not prove that your results generalize to a wider population—they merely increase your certainty about the specific sample you studied.

• Use statistical methods such as hierarchical models and clustered standard errors to account for a strong dependence between your measurements.

• Design experiments to eliminate hidden sources of correlation between variables. If that’s not possible, record confounding factors so they can be adjusted for statistically. But if you don’t consider the dependence from the beginning, you may find it is too late to get the desired data in the statistical analysis stage.

When to take your child to an urgent care vs. the emergency room

Randomization, independence and pseudo-replication