To make strides in understanding the world around us, we rely heavily on tools that allow us to see more than we can see with the naked eye. The tools we use, and how we apply them, affect how we perceive phenomena, for better or worse. This theme is encapsulated in a recent article in the LA Times about the re-examination and growing scrutiny of functional magnetic resonance imaging, better known by its abbreviation, fMRI. This story not only highlights limitations with the technology itself, but also with how we analyze and interpret the results it provides.
What is fMRI?
The fMRI concept builds on magnetic resonance imaging (MRI) scan technology. MRI is itself an amazing tool which allows us to visualize interior tissues non-invasively and without exposure to ionizing radiation. Patients simply enter the MRI scanner, and we can obtain three-dimensional, detailed anatomical images of soft tissues such as the brain. For more detail on how MRI works, check out the podcast episode with Raj Attariwala during which we discuss MRI (between 1:03:45 and 1:23:45 of our conversation).
fMRI extends the capabilities of MRI to capture – you guessed it – functional changes in the brain caused by neuronal activity. Neural activity is associated with changes in blood oxygenation levels: when neurons become active, local blood flow to those brain regions increases, and oxygen-rich blood displaces oxygen-depleted blood. Because oxygen-rich blood is less magnetic than oxygen-depleted blood, this blood exchange results in a local change in magnetic resonance signal strength. Variations in signal strength can be mapped across the whole brain to show which brain regions are active at a particular time – i.e., a map of brain function in a particular context. In addition to extensive use in neuroscience research, fMRI technology is applied in clinical settings to determine neurological effects of disease or injury, such as cognitive impairments associated with Alzheimer’s Disease or following stroke.
How is fMRI used?
fMRI is frequently used to examine changes in brain activity while a person performs a task or is presented with a stimulus. Investigators analyze a mass of data to detect correlations between brain activation and whatever task or stimulus is given during the scan. For example, let’s say researchers are interested in how the brain responds to visual food cues vs. images of non-food items. A subject is placed in an fMRI and presented with a series of images depicting appetizing food. The researchers get beautiful 3D images of the brain. The subject is then presented with a series of images unrelated to food, during which the researchers acquire more images. The researchers then compare the two sets of scans to see if any correlations exist between the observed brain activity and the food vs. non-food stimuli.
The 3D images obtained with fMRI scans are derived from small, cube-shaped units of volume called voxels (a portmanteau of “volume” and “pixel”). These units collectively make up a full 3D fMRI scan much like square-shaped pixels make up a 2D image. Typically corresponding to a brain area of about 3 mm3, each single fMRI voxel can encompass as many as a million individual neurons. The exact number of voxels per fMRI scan can vary, but generally exceeds 100,000 in an adult human. In most fMRI analyses, every voxel is treated independently to determine whether it is “activated” by a certain stimulus or task. In other words, analysis involves running over 100,000 individual tests and looking for a correlation in each one of them.
The Multiple Comparison Problem
The more tests you run, the greater the probability you’ll get a false positive: a result that incorrectly indicates the presence of a condition when no such condition is truly present. By performing many comparisons, you are increasing the chances of finding a result that’s statistically “significant” but actually not there at all. This is aptly referred to in science as the multiple comparisons problem. Consider a drug that is tested for 100 different effects at the same time, as opposed to a single target outcome. It would be surprising if the drug had no significant effect on any of the health measures; the more measurements we make at one time, the larger the chance that a random association will be classified as a meaningful result. There are various ways to counteract this problem – the Bonferroni correction, for example, which statistically lowers the probability of false positives (at the expense of increasing the probability of false negatives).
As it turns out, most fMRI studies have not been applying any statistical methods to adjust for multiple comparisons. What’s the likelihood of getting a false positive when running >100,000 tests without correcting for multiple comparisons? If you stuck a dead fish in an fMRI and gave it a task, you’d probably see its brain light up. Think I’m exaggerating?
What happens if you put a dead salmon in an fMRI machine?
This paper, a recipient of the 2012 Ig Nobel Prize in neuroscience, is a beautiful example of a fun—and seemingly ludicrous—study that can make an impact in science. (The Ig Nobel Prizes “are intended to celebrate the unusual, honor the imaginative — and spur people’s interest in science, medicine, and technology.”) To illustrate the magnitude of the problem of multiple comparisons using fMRI, the investigators stuck a dead salmon in the machine, presented photographs of human interactions, and asked the salmon to determine what emotion the individual in the photo must have been experiencing. Upon analyzing the results, these investigators found significant increases in fMRI brain activity in the dead salmon when they did not correct for multiple comparisons. “Out of a search volume of 8064 voxels a total of 16 voxels were significant,” the investigators wrote. When they used methods to correct for multiple comparisons, none of the voxels lit up. If that isn’t a strong argument for multiple comparisons correction, I don’t know what is.
The “Test-Retest Reliability” Problem
In addition to the multiple comparisons problem, many fMRI studies face another concern: test-retest reliability. Test-retest reliability asks whether two measurement opportunities in the same subject under the same conditions would result in similar scores. fMRI is generally a reliable measure for average brain activity responses to a particular task or stimulus among a group of people, but many recent studies have focused on using it to predict activity patterns at the level of an individual person. This is where fMRI reliability falls short, as any given person experiences fluctuations in exact blood flow patterns and is therefore unlikely to yield similar fMRI images across separate tests. The senior author of a recent meta-analysis of 90 fMRI experiments, Ahmad Hariri, put it this way: “The correlation between one scan and a second is not even fair, it’s poor.” To the author’s credit, he hopes to use what he’s learned from this analysis to amend his own research practices as well as those of others in his field. “[These results are] more relevant to my work than just about anyone else!” Hariri said. “I’m going to throw myself under the bus. This whole sub-branch of fMRI could go extinct if we can’t address this critical limitation.”
The bottom line.
The tools we use for observation and experimentation are critical components of the scientific process. To understand the host of natural phenomena that evade our sensory perception, we depend heavily on technological progress to give us tools like fMRI, which allow us to see more than ever before. With that said, with new tools come new dangers for misuse and misinterpretation of results. As technology opens new doors, we – like Hariri and his colleagues – must constantly evaluate its limitations, even as we look to expand its applications.
This is the Butterfly effect or and example of the rounding error problem from Chaos Theory.
Maybe every field of science has an Achilles heel that they willingly ignore in order to maintain productivity, or the semblance of progress. I know mine does.
An easy way to explore this in excel is to create a spreadsheet with x columns and y rows. label all the columns “Variable”x, and the rows “Sample”y, then fill the data with random numbers and examine for statistical significance. The more columns the more supposedly significant “Variables”, more rows “Samples the harder reduce the error, but it will still show up. Remember these are completely random numbers. The same can be done with more precision in R, or S but excel is easy and sufficient for this.
Mine certainly has it: Nutritional Science and the Food Recall Questionnaire.
This post, while well-intentioned, is not quite correct. The quote: “As it turns out, most fMRI studies have not been applying any statistical methods to adjust for multiple comparisons.” is simply incorrect. The multiple comparison problem in fMRI was recognized and addressed over 20 years ago, and virtually no fMRI studies today do not do multiple comparison correction – it’s a standard processing step in all major fMRI processing platforms. On the other hand, the test-retest reliability of fMRI has been an issue – however, not because the fMRI signal is unreliable, but because it’s so rich with information that we have not fully modeled – and we’re making good progress. On my podcast “OHBM Neurosalience” (link below), I have a great discussion with Ahmad Hariri on this issue and the final sentiment is much more positive – with clear avenues for traction to be made in improving what fMRI can show. The bottom line is that before we make any definitive statement of the limitations of fMRI, we need to more fully understand all the information that is in the signal so that we may more fully separate it from the noise. As a simple analogy, if you measure blood pressure of individuals who are randomly either running or sitting down (and don’t use this information) then you would conclude that blood pressure measures are unreliable. In fMRI, even activation-based, the brains of individuals have many different states. We are trying to derive the brain traits, but still need to account for the states to reduce variability. The actual measures are likely more reliable than we may at first conclude; however, we need to better account for the states that the brain moves between. There’s more to this, but I just wanted to illustrate this point. There’s quite a bit more progress that will be made. Being in the field for over 30 years, I’m perhaps biased but my hope and excitement about the future of fMRI for understanding the brain and for making a clinical impact is only increasing. https://open.spotify.com/episode/1bkkS7xiUlBGfrfj63yqX4
Just to add, there are many other strategies that have been used to increase fMRI reproducibility, ranging from MRI pulse sequences (i.e. something called “multi-echo EPI” to separate blood oxygenation effects from other sources), to removing such effects as motion, respiration, and cardiac noise, to more advance machine learning strategies. While fMRI and other neuroimaging methods have clear limits, we have not really fully found them yet.