When Miguel Ángel del Pozo got the email, he didn’t feel great about it. As the head of a lab at Madrid’s Centro Nacional de Investigaciones Cardiovasculares Carlos III, del Pozo had been lead author on a 2011 article in the journal Cell about how a molecule called caveolin-1 changes the microenvironment around cancer cells. Yet here it was, four years after the actual research, and the Reproducibility Project was calling.
Del Pozo’s paper was one of more than 50, nearly 200 experiments in all, that a research team hoped to replicate—to re-create and see if they reached the same conclusion. In 2013, Brian Nosek and Tim Errington of the Center for Open Science announced their intentions to try this in the field of preclinical cancer biology. “When you get the email that they are going to reproduce your results in a very high-impact journal, there are two sides to that,” del Pozo says. After all, what if you got it wrong? But he felt some responsibility. So del Pozo agreed to play along.
That was critical, because the scientists-for-hire who were redoing the experiments needed help. For any given attempt, they might need clarifications on the protocols or access to specific reagents, antibodies, or cell lines. Sometimes they needed a particular strain of genetically engineered mouse. The asks could get big. Del Pozo is the kind of person who gets 2,000 emails a day. He’s a single dad. He runs a big lab. And the two postdoctoral students who’d actually led the research had long since departed to run their own labs. “It was a big effort for us to provide them with all the information they requested. It was a huge amount of detail, even the batch number of the antibody we used. But I took this seriously, and I thought, ‘We have to help,’” del Pozo says. “Also, our reputation is kind of being examined. Nobody wants this to happen, but when it happens, I think it’s better not to hide.”
The Reproducibility Project: Cancer Biology was indeed a massive effort that would eventually involve 200 people and publish more than 50 papers. Now, in two “capstone” articles published today in the journal eLife, Errington and his colleagues report their final results. But they didn’t work out exactly as the Center for Open Science team hoped.
One of the papers is a meta-analysis of their findings, and the other—spoiler here—is on the challenges the team faced in trying to make this happen. Del Pozo was a bit of an exception; more than half of the researchers the team contacted declined to share data or didn’t respond at all. About the same number wouldn’t share their reagents. Some wouldn’t (or couldn’t) clarify their methods sections enough to be useful. In the end, the team could only replicate 50 experiments from 23 papers.
Those experiments didn’t do so well. The COS team used five different ways to look at success or failure—like whether the effect size they saw in the replication was bigger than in the original, or whether that original effect size was within the confidence interval of the replication effect size, or vice versa. Of the five criteria, more than half the effects the group looked at missed three or more targets. One in five missed them all.
Most PopularThe End of Airbnb in New YorkBusiness
For a given positive result that an experiment might have shown—shrink a tumor, improve survival rates, whatever—those got less impressive. If a re-do found the positive effect that the original did, the re-do effect size was a median of 85 percent smaller. Studies on animals like mice replicated less often and had smaller effect sizes compared to studies in cells. (A couple of del Pozo’s experiments replicated. One didn’t.) “The findings have challenges for the credibility of preclinical cancer biology,” said Nosek, executive director of the Center for Open Science, at a press conference last week.
You’ve heard this riff before. In 2005, a Stanford physician named John Ioannidis published a now-famous essay called “Why Most Published Research Findings Are False,” laying out the incentives that lead researchers to publish only positive results and even sometimes juke their statistics—enough, even within the bounds of research ethics, to bias articles to “yay” instead of “nope.” (Ioannidis has since become something of an iconoclast about anti-Covid measures.) That led, over the next decade, to a “reproducibility crisis,” with researchers like Errington and Nosek finding failures to replicate in fields like psychology and economics, and hints that the problems were rife in every specialty from astrophysics to zoology.
Preclinical cancer biology was particularly worrisome. It’s the science that happens before a molecule becomes a drug candidate, before human trials, before regulatory approval and doctors writing prescriptions. In 2011, researchers at Bayer reported that only 20 to 25 percent of their internal attempts to replicate preclinical cancer work succeeded; the next year, researchers at the pharmaceutical giant Amgen wrote in Nature that just about 10 percent of the basic, bench-level research they saw was actually reproducible. “The criticism of that work, which was legitimate, was that we never disclosed the papers we couldn’t reproduce. We couldn’t because we’d signed confidentiality agreements with the researchers,” says Glenn Begley, the head of oncology and hematology at Amgen at the time and coauthor of the 2012 Nature paper. The two new COS papers are “significant steps forward, because they did their work prospectively, whereas what I did was a historical review over 10 years,” he says. “What they’ve done is taken papers that have already received attention within the scientific community, and then they’ve set out to try and determine whether or not that work could be independently reproduced. These are really excellent papers. They’re really first class.”
The work took years—all of the papers are from the 2010 to 2013 timeframe. And the results, or their lack, come as something of a double disappointment. “The whole point of why researchers, myself included, get into the preclinical space is you’re hoping to make an impact. If you’re really lucky, you can pop out and make a difference in the world,” Errington says. His psychology replication work helped to discredit some popular, TED-talk level work in the field, incited backlash against what some researchers saw as vigilantism, and forced significant reflection among researchers.
Most PopularThe End of Airbnb in New YorkBusiness
The outcomes here are much less clear. The extensive supplementary materials the replication team handed out helpfully distinguish between “reproducibility” (do the results of an experiment turn out the same if you do it again with the same data and approach?) and “replicability” (can a new, overlapping experiment with new data yield reliably similar results?).
The COS team has tried to be explicit about how messy this all is. If an experiment fails to replicate, that doesn’t mean it’s unreplicable. It could have been a problem with the replication, not the original work. Conversely, an experiment that someone can reproduce or replicate perfectly isn’t necessarily right, and it isn’t necessarily useful or novel.
But the truth is, 100 percent pure replication isn’t really possible. Even with the same cell lines or the same strain of genetically tweaked mice, different people do experiments differently. Maybe the ones the replication team didn’t have the materials to complete would have done better. Maybe the “high-impact” articles from the most prestigious journals were bolder, risk-taking work that’d be less likely to replicate.
Cancer biology has high stakes. It’s supposed to lead to life-saving drugs, after all. The work that didn’t replicate for Errington’s team probably didn’t lead to any dangerous drugs or harm any patients, because Phase 2 and Phase 3 trials tend to sift out the bad seeds. According to the Biotechnology Industry Organization, only 30 percent of drug candidates make it past Phase 2 trials, and just 58 percent make it past Phase 3. (Good for determining safety and efficacy, bad for blowing all that research money and inflating drug costs.) But drug researchers acknowledge, quietly, that most approved drugs don’t work all that well at all—especially cancer drugs.
Science obviously works, broadly. So why is it so hard to replicate an experiment? “One answer is: Science is hard,” Errington says. “That’s why we fund research and invest billions of dollars just to make sure cancer research can have an impact on people’s lives. Which it does.”
The point of less-than-great outcomes like the cancer project’s is to distinguish between what’s good for science internally and what’s good for science when it reaches civilians. “There are two orthogonal concepts here. One is transparency, and one is validity,” says Shirley Wang, an epidemiologist at Brigham and Women’s Hospital. She’s codirector of the Reproducible Evidence: Practices to Enhance and Achieve Transparency (“Repeat”) Initiative, which has done replication work on 150 studies that used electronic health records as their data. (Wang’s Repeat paper hasn’t been published yet.) “I think the issue is that we want that convergence of both,” she says. “You can’t tell if it’s good quality science unless you can be clear about the methods and reproducibility. But even if you can, that doesn’t mean it was good science.”
The point, then, isn’t to critique specific results. It’s to make science more transparent, which should in turn make the results more replicable, more understandable, maybe even more likely to translate to the clinic. Right now, academic researchers don’t have an incentive to publish work that other researchers can replicate. The incentive is just to publish. “The metric of success in academic research is getting a paper published in a top-tier journal and the number of citations the paper has,” Begley says. “For industry, the metric of success is a drug on the market that works and helps patients. So we at Amgen couldn’t invest in a program that we knew from the beginning didn’t really have legs.”
Most PopularThe End of Airbnb in New YorkBusiness
Wasting less money on doomed drug trials might actually help make drugs cheaper. That’s something the funders of this research clearly care about. The money came from Arnold Ventures—at inception, the Laura and John Arnold Foundation—a longtime funder of reproducibility work. Reproducibility isn’t the only cudgel the Arnolds wield against the drug business; they were also funders and supporters of a Democratic plan in 2019 to lower drug prices, and as of 2018 (according to an article in The Wall Street Journal) they were the main funders of the Institute for Clinical and Economic Review at Harvard Medical School, a small but powerful advocate of a statistical test called Quality-Adjusted Life Years that calculates a drug’s value as a function of how much life it extends—a very expensive drug should add years onto a lifespan, not months or weeks.
Or look at an even more basic ethical determination: Lab animals get “sacrificed,” killed, as part of these studies. Human volunteers in drug trials potentially suffer and assume risk. If an experiment has no chance of working, that all becomes unjustifiable. It’d be good to know when to stop.
The way to fix all this, then, is to incentivize researchers to share their data and the trickiest parts of their methods, and preregister their protocols and hypotheses for everyone to see—all the things that make experiments easier to replicate. Right now, granting agencies don’t ask for all of that, and neither do all journals. Del Pozo says one of the reasons he had to spend time clarifying his protocols for the replication team is that the journal he published in only let him include seven charts or graphs. That meant his group had to cut information that could’ve been useful. “Science is like this. Many times, the dogma changes,” he says. “It’s not like Ptolemy was lying about the sun going around the Earth. With the tools he had, it was the best he could propose. So I am not afraid.”
Updated 12/7/21 12:55 PM PT: A previous version of this story misstated the amount of time between del Pozo's research and the email he received, and conflated the number of papers versus experiments the project hoped initially to replicate.
More Great WIRED Stories📩 The latest on tech, science, and more: Get our newsletters!The Twitter wildfire watcher who tracks California’s blazesHow science will solve the Omicron variant’s mysteriesRobots won’t close the warehouse worker gap soonOur favorite smartwatches do much more than tell timeHacker Lexicon: What is a watering hole attack?👁️ Explore AI like never before with our new database🏃🏽♀️ Want the best tools to get healthy? Check out our Gear team’s picks for the best fitness trackers, running gear (including shoes and socks), and best headphones