To make strides in understanding the world around us, we rely heavily on tools that allow us to see more than we can see with the naked eye. The tools we use, and how we apply them, affect how we perceive phenomena, for better or worse. This theme is encapsulated in a recent article in the LA Times about the re-examination and growing scrutiny of functional magnetic resonance imaging, better known by its abbreviation, fMRI. This story not only highlights limitations with the technology itself, but also with how we analyze and interpret the results it provides.
What is fMRI?
The fMRI concept builds on magnetic resonance imaging (MRI) scan technology. MRI is itself an amazing tool which allows us to visualize interior tissues non-invasively and without exposure to ionizing radiation. Patients simply enter the MRI scanner, and we can obtain three-dimensional, detailed anatomical images of soft tissues such as the brain. For more detail on how MRI works, check out the podcast episode with Raj Attariwala during which we discuss MRI (between 1:03:45 and 1:23:45 of our conversation).
fMRI extends the capabilities of MRI to capture – you guessed it – functional changes in the brain caused by neuronal activity. Neural activity is associated with changes in blood oxygenation levels: when neurons become active, local blood flow to those brain regions increases, and oxygen-rich blood displaces oxygen-depleted blood. Because oxygen-rich blood is less magnetic than oxygen-depleted blood, this blood exchange results in a local change in magnetic resonance signal strength. Variations in signal strength can be mapped across the whole brain to show which brain regions are active at a particular time – i.e., a map of brain function in a particular context. In addition to extensive use in neuroscience research, fMRI technology is applied in clinical settings to determine neurological effects of disease or injury, such as cognitive impairments associated with Alzheimer’s Disease or following stroke.
How is fMRI used?
fMRI is frequently used to examine changes in brain activity while a person performs a task or is presented with a stimulus. Investigators analyze a mass of data to detect correlations between brain activation and whatever task or stimulus is given during the scan. For example, let’s say researchers are interested in how the brain responds to visual food cues vs. images of non-food items. A subject is placed in an fMRI and presented with a series of images depicting appetizing food. The researchers get beautiful 3D images of the brain. The subject is then presented with a series of images unrelated to food, during which the researchers acquire more images. The researchers then compare the two sets of scans to see if any correlations exist between the observed brain activity and the food vs. non-food stimuli.
The 3D images obtained with fMRI scans are derived from small, cube-shaped units of volume called voxels (a portmanteau of “volume” and “pixel”). These units collectively make up a full 3D fMRI scan much like square-shaped pixels make up a 2D image. Typically corresponding to a brain area of about 3 mm3, each single fMRI voxel can encompass as many as a million individual neurons. The exact number of voxels per fMRI scan can vary, but generally exceeds 100,000 in an adult human. In most fMRI analyses, every voxel is treated independently to determine whether it is “activated” by a certain stimulus or task. In other words, analysis involves running over 100,000 individual tests and looking for a correlation in each one of them.
The Multiple Comparison Problem
The more tests you run, the greater the probability you’ll get a false positive: a result that incorrectly indicates the presence of a condition when no such condition is truly present. By performing many comparisons, you are increasing the chances of finding a result that’s statistically “significant” but actually not there at all. This is aptly referred to in science as the multiple comparisons problem. Consider a drug that is tested for 100 different effects at the same time, as opposed to a single target outcome. It would be surprising if the drug had no significant effect on any of the health measures; the more measurements we make at one time, the larger the chance that a random association will be classified as a meaningful result. There are various ways to counteract this problem – the Bonferroni correction, for example, which statistically lowers the probability of false positives (at the expense of increasing the probability of false negatives).
As it turns out, most fMRI studies have not been applying any statistical methods to adjust for multiple comparisons. What’s the likelihood of getting a false positive when running >100,000 tests without correcting for multiple comparisons? If you stuck a dead fish in an fMRI and gave it a task, you’d probably see its brain light up. Think I’m exaggerating?
What happens if you put a dead salmon in an fMRI machine?
This paper, a recipient of the 2012 Ig Nobel Prize in neuroscience, is a beautiful example of a fun—and seemingly ludicrous—study that can make an impact in science. (The Ig Nobel Prizes “are intended to celebrate the unusual, honor the imaginative — and spur people’s interest in science, medicine, and technology.”) To illustrate the magnitude of the problem of multiple comparisons using fMRI, the investigators stuck a dead salmon in the machine, presented photographs of human interactions, and asked the salmon to determine what emotion the individual in the photo must have been experiencing. Upon analyzing the results, these investigators found significant increases in fMRI brain activity in the dead salmon when they did not correct for multiple comparisons. “Out of a search volume of 8064 voxels a total of 16 voxels were significant,” the investigators wrote. When they used methods to correct for multiple comparisons, none of the voxels lit up. If that isn’t a strong argument for multiple comparisons correction, I don’t know what is.
The “Test-Retest Reliability” Problem
In addition to the multiple comparisons problem, many fMRI studies face another concern: test-retest reliability. Test-retest reliability asks whether two measurement opportunities in the same subject under the same conditions would result in similar scores. fMRI is generally a reliable measure for average brain activity responses to a particular task or stimulus among a group of people, but many recent studies have focused on using it to predict activity patterns at the level of an individual person. This is where fMRI reliability falls short, as any given person experiences fluctuations in exact blood flow patterns and is therefore unlikely to yield similar fMRI images across separate tests. The senior author of a recent meta-analysis of 90 fMRI experiments, Ahmad Hariri, put it this way: “The correlation between one scan and a second is not even fair, it’s poor.” To the author’s credit, he hopes to use what he’s learned from this analysis to amend his own research practices as well as those of others in his field. “[These results are] more relevant to my work than just about anyone else!” Hariri said. “I’m going to throw myself under the bus. This whole sub-branch of fMRI could go extinct if we can’t address this critical limitation.”
The bottom line.
The tools we use for observation and experimentation are critical components of the scientific process. To understand the host of natural phenomena that evade our sensory perception, we depend heavily on technological progress to give us tools like fMRI, which allow us to see more than ever before. With that said, with new tools come new dangers for misuse and misinterpretation of results. As technology opens new doors, we – like Hariri and his colleagues – must constantly evaluate its limitations, even as we look to expand its applications.