Want to catch up with other articles from this series?
- Studying Studies: Part I – relative risk vs. absolute risk
- Studying Studies: Part II – observational epidemiology
- Studying Studies: Part III – the motivation for observational studies
- Studying Studies: Part IV – randomization and confounding
- Studying Studies: Part V – power and significance
- Ask Me Anything #30: How to Read and Understand Scientific Studies
Randomization: the major strength (and limitation) of studies
We left off in Part III discussing the motivation for observational studies and the types of studies employed in observational epidemiology (i.e., retrospective and prospective cohort studies), and some of the major limitations of those types of studies. As we mentioned, observational studies are prone to bias and confounding. A confounder can create a spurious association between an exposure and outcome being observed in an observational study. We introduced this “confounding bias” in Part III of Studying Studies.
Randomization, a method based on chance alone by which study participants are assigned to a treatment group, is a key component in distinguishing cause and effect, and eliminating confounding. By randomly assigning subjects to an intervention or control group, investigators can measure the effect of the intervention without the subjects self-selecting their lot in the experiment, as happens in observational studies. Successful randomization actually controls for bias and confounding by design and before the trial is conducted.
Why do some scientists and sciences progress more effectively than others? “Scientists these days,” writes John R. Platt in Science in 1964, “tend to keep a polite fiction that all science is equal.” Platt reduces his answer to a touchstone that he calls “The Question.”
Obviously it should be applied as much to one’s own thinking as to others’. It consists of asking in your own mind, on hearing any scientific explanation or theory put forward, “But sir, what experiment could disprove your hypothesis?”; or, on hearing a scientific experiment described, “But sir, what hypothesis does your experiment disprove?”
“This goes straight to the heart of the matter,” Platt writes. “It forces everyone to refocus on the central question of whether there is or is not a testable scientific step forward.” Randomization is a critical piece of scientific equipment needed in order for a hypothesis to be testable. Randomization prevents investigators from introducing systematic (and often hidden) biases between experimental groups. It can assure that each arm has the same demographics and perhaps most importantly, randomization minimizes measured (and unmeasured) confounding factors.
Controlling for confounding
“When experimental designs are premature, impractical, or impossible,” notes a 2012 article, “researchers must rely on statistical methods to adjust for potentially confounding effects.” Unlike selection or information bias (also introduced in Part III) “confounding is one type of bias that can be adjusted after data gathering, using statistical models.”
People who drink more coffee may also smoke more cigarettes and drink more alcohol. If investigators want to determine if coffee drinking alone increases mortality risk, and is not just a marker (confounder) for another factor, they must use statistical measures to approach the question.
How do researchers handle confounding variables? They control for them as best they can, for as many of them as they can anticipate, trying to minimize their possible effect on the response. In experiments involving human subjects, researchers have to battle many confounding variables.
In a meta-analysis of red meat and processed meat consumption and all-cause mortality, for example, the investigators selected nine prospective studies and noted that, “All studies adjusted for age, sex (if applicable), and smoking. Most studies also controlled for physical activity (n = 8 studies), alcohol consumption (n = 8 studies), total energy intake (n = 7 studies), body mass index or body weight (n = 7 studies), and markers of socioeconomic status (n = 5 studies).”
When researchers say they have “controlled” or “adjusted for” confounders, the basic technique is to include measures of these potential confounders as regressors (i.e., covariates: a variable that is possibly predictive of the outcome under study) in a regression model, or stratify the data on these confounders. Researchers collect data on all the known (and previously identified) confounders in their analysis, which typically consists of stratification and multivariate models.
What follows is, at best, a 100,000-foot view of some of these statistical techniques. Despite the fact that I have a degree in applied mathematics (including about a dozen grad school courses on the topic), while I was in the thick of things in the lab I never went anywhere without a textbook on biostatistics. And despite my knowledge base in math, I still needed that book nearly weekly to fully understand the statistical methods employed by the authors of papers I read. Because my textbook is over 20 years old, I’ll refrain from recommending it because I have no idea if a better one isn’t already out there (readers: please feel free to make recommendations in the comment section).
Stratification is a technique used to fix the level of the confounders and produce groups (i.e., subgroups) within which the confounder does not vary. This allows one to evaluate the exposure-outcome association within each stratum of the confounder. So within each stratum, the confounder cannot confound because it does not vary across the exposure-outcome. This works modestly well when the number of perceived confounders is small.
If the number of potential confounders or the level of their grouping is large, multivariate analysis offers the only solution.
Multivariate analyses are a family of statistical models (over a dozen models) based on the principle of multivariate statistics. Such models can handle large numbers of covariates (and also confounders) simultaneously. For example, in a study trying to measure the relation between body mass index and dyspepsia, one could control for many covariates like age, sex, smoking, alcohol, ethnicity, and more, in the same model.
One of the most common tools in multivariate analyses is linear regression, often abbreviated, ‘regression.’ Regression, in its simplest form, is fitting the best straight line to a dataset. That is, finding the equation (y = mx + b) that best predicts the (linear) relationship of the observed data (y) from the experimental variable (x). With this equation, you can predict what happens outside of your experimental parameters (in theory). Extrapolation using regression is a slippery slope, no pun intended. Just ask the banking industry how well their complex regression models predicted losses when the models were built on datasets gathered only during the era of monotonic increases in home prices.
Logistic regression (LR) is a type of regression that can deal with far more complex relationships between data than a linear relationship between continuous variables (e.g., height and number of push-ups you can do in a minute). In many cases we have more than one dependent variable and/or our dependent variable is binary while our independent variable is continuous (e.g., does amount of vodka drank predict if you will puke or not puke?). And it gets more complicated, of course, as the number of possible outcomes increases. LR is pretty freakin’ cool and when you go down this biostats rabbit hole, as I first did in 1999, you learn about a guy name David Cox who developed LR and, eventually, something called Cox Proportional Hazard models which predicts survival (or demise). Cox proportional models may be among my favorite models. Really cool stats and nerdy distributions (most people have heard of normal distributions, and maybe even the odd Poisson distribution, but once you get in the Weibull distribution, the fun really starts… #nerdalert).
Can all confounders be controlled for?
Confounding can persist, even after adjustment. There is often residual confounding (i.e., confounding that remains after controlling for confounding in the design or analysis of a study), which can be due to a measurement error in the model as well as unmeasured confounding. In many studies, confounders are not adjusted because they were not measured during the process of data gathering.
In some situations, confounder variables are measured with error. How much do you adjust for multiple confounders in the same study and how much crosstalk is there between all of them? These are the kinds of issues we’re dealing with when observing humans (we’re very messy creatures, especially from a scientific measurement perspective). You can’t always eliminate every confounding factor because you don’t know what you’re not looking for. There are almost an infinite number of possibilities that can confound an observation. You can only measure the ones you know and hope you’ve measured them all. One analysis concluded, for example, that the effect sizes of the magnitude frequently reported in observational epidemiologic studies can be generated by residual and/or unmeasured confounding alone.
Another problem emerges when trying to control for multiple confounders. Alex Reinhart, author of Statistics Done Wrong, points out that it’s common to interpret results by saying, “If weight increases by one pound, with all other variables held constant, then heart attack rates increase by X percent.” This could be true, but how do you hold all the other variables constant in practice? “You can quote the numbers from the regression equation, but in the real world the process of gaining a pound of weight also involves other changes,” writes Reinhart. “Nobody ever gains a pound with all other variables held constant, so your regression equation doesn’t translate to reality.”
“Because bias due to confounding is a core limitation of observational research,” write the authors, headed by John Ioannidis, in an assessment of whether authors of observational studies consider confounding bias, “numerous recommendations and statements1The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) statement is considered the most widely endorsed guideline for reporting observational research. call for a careful consideration when reporting, discussing, and making conclusions from observational research.” In their paper, the authors randomly selected 120 observational studies and found that 57% of them mentioned “confounding” in the abstract or discussion sections (another 17% alluded to it) and there was no mention or allusion at all in 27% of the studies.
The authors concluded that many observational studies lacked satisfactory discussion of confounding bias. “Even when confounding bias is mentioned,” write the authors, “authors are typically confident that it is rather irrelevant to their findings and they rarely call for cautious interpretation.”
Can observational studies establish reliable knowledge on its own? It depends on who you ask
In 2013, authors scoured the NEJM, Lancet, JAMA, and the Annals of Internal Medicine (i.e., the absolute top 0.001% of medical journals) for articles of observational (“nonrandomized”) studies published in 2010 and then classified them based on whether the authors recommended a medical practice, and whether they state that a randomized trial is needed to validate their recommendation.
What they found was in 56% of the 298 studies in these high impact journals, the authors recommended a medical practice based on their results. Of these 167 studies, just 24 (14%) of them state that a randomized controlled trial should be done to support their recommendation. The other 143 articles made logical extrapolations to recommend specific medical practices. I still remember exactly where I was sitting when I read this paper in 2013. I almost fell out of my chair.
As Ioannidis wrote in 2005, even in well-done observational studies, it’s more likely for a research claim to be false than true, and association doesn’t prove causation. But look at the amount of data that rolls right through these red lights. It’s hard not to come away with the impression of illusory superiority from the authors of observational studies, a bias we haven’t talked about, “whereby a person overestimates [his or her] own qualities and abilities, in relation to the same qualities and abilities of other persons.”
In the case of observational epidemiology, there appears to be both illusory superiority (one of my favorite examples is that more than 90% of people in the US put their own driving skills in the top 50%) and illusion of control, where, more importantly, investigators overestimate the quality of their own research in relation to the inherent limitations of the performance characteristics of their own equipment. Mistakes are made in epidemiology, but not by us. . . Correlation doesn’t imply causation, but our study’s different. . . Yes, yes, healthy-users, confounding, and so on are pervasive threats, but we controlled for those. (There’s something I like to call the Feynman affect that might be at play here. It goes like this. I’m very aware of what I don’t know. I’ve read and appreciate Mistakes Were Made (but not by me). I continually remind myself of Feynman’s first principle: I must not fool myself, and I am the easiest person to fool. Because I’m acutely aware of my own limitations, I can imply causation, and when a mistake is made, it really wasn’t me who made it. Admittedly, I need to make sure I eat my own cooking on this one.)
It’s also problematic that these top-tier journals publish these extrapolations. We all need to be our own intellectual and literal gatekeepers. (We’re trying, but almost assuredly a few pucks are reaching the back of our own nets on a daily basis. Did I mention I grew up in Canada and played goalie as a kid?) For the journal itself, this gatekeeping is more literal and sometimes just means upholding their own standards.
Does lack of correlation imply lack of causation?
Observational epidemiology is essentially a categorical and systematic way of observing things. It is often said that a hypothesis is an educated guess. Observational epidemiology can help us learn more about our guess and it can help create new guesses. But it can do more harm than good if misused. Even if we believe observational epidemiology is the best system available to infer causality, it also is very capable of providing wrong answers. (A 2005 paper by John Ioannidis, as we mentioned, showed that it is more likely for a research claim to be false than true. We also cited, in Part III, an analysis that demonstrated when observational claims were then tested in subsequent randomized trials, those claims touted an 0-for-52 success rate.) Here’s the rub with observational epidemiology: smoking and lung cancer, chimney sweeps and scrotal cancer—these are the exceptions, not the rules.
One thing I’d be remiss not to include here—that I alluded to in the proviso in Part II—is a brief word on the “contrapositive” cases, where I think observational epidemiology is not used enough. Quite simply, if study after study (after study…) fails to show an association, or the association is consistently very weak, isn’t it likely there is no there there? Is the “Bradford Hill Criteria” more useful in reverse? If observational study after observational study shows a lack of the nine criteria—strength, consistency, specificity, temporality, biological gradient, plausibility, coherence, and analogy—between aspartame and cancer, for example, is it pretty likely aspartame doesn’t cause cancer, at least in the doses that most people consume it? (You may have noticed there’s only eight in that list. The one exclusively non-observational from the criteria missing is experiment. This is the only one that works for separating causation from association, argues Ioannidis.)
Limitations to epidemiology, and knowledge, in general
The law of unintended consequences is not one to be ignored or downplayed. We are more wrong than right in science, including experimental science. Yet, our human intuitions are often the opposite: we reflexively think we’re more right than wrong. I’m so guilty of this, in fact, that I force myself to read books that remind me of my own intellectual hubris and the ease with which I can fool myself. (I’ve referred to my favorite books on this topic elsewhere.) Kathryn Schulz, the author of Being Wrong, one of my favorites, says that while we know what it feels like to have been wrong in the past, we don’t know what it feels like to be wrong in the moment. “It does feel like something to be wrong,” she says. “It feels like being right.”
Stop and think about science, especially observational epidemiology for a moment. An association alone can almost never necessarily be proven right and it’s almost equally as difficult to prove an association wrong. Being anything less than brutally honest about this—and being anything less than hypercritical of ourselves and what we see—is detracting from good science. The vast majority of associations that we find, even when they are “significant,” are not causal. We can so very easily fool ourselves, so if we are to err, we should probably err on the side of skepticism and doubt when it comes to most epidemiology.
As we’ve seen, it is virtually impossible for investigators to eliminate bias using observational studies and statistical methods. Randomization is one of the most reliable methods of reducing bias and confounding and therefore one of the best tools we have for scientific progress. Unfortunately, in epidemiology and public health, it often remains on the sidelines for reasons discussed in Part II of this series. Perhaps Vinay Prasad and his colleagues sum it up best:
In conclusion, our empirical evaluation shows that linking observational results to recommendations regarding medical practice is currently very common in highly influential journals. Such recommendations frequently represent logical leaps. As such, if they are correct, they may accelerate the translation of research but, if they are wrong, they may cause considerable harm.
To date, the evidence strongly suggests more of the latter over the former.
In a future installment, we’ll discuss the power of power analysis, the difference between statistical and practical significance, and more.