Sometimes, basic information is missing because it’s proprietary—an issue especially for industry labs. But it’s more often a sign of the field’s failure to keep up with changing methods, Dodge says. A decade ago, it was more straightforward to see what a researcher changed to improve their results. Neural networks, by comparison, are finicky; getting the best results often involves tuning thousands of little knobs, what Dodge calls a form of “black magic.” Picking the best model often requires a large number of experiments. The magic gets expensive, fast.
Even the big industrial labs, with the resources to design the largest, most complex systems, have signaled alarm. When Facebook attempted to replicate AlphaGo, the system developed by Alphabet’s DeepMind to master the ancient game of Go, the researchers appeared exhausted by the task. The vast computational requirements—millions of experiments running on thousands of devices over days—combined with unavailable code, made the system “very difficult, if not impossible, to reproduce, study, improve upon, and extend,” they wrote in a paper published in May. (The Facebook team ultimately succeeded.)
The AI2 research proposes a solution to that problem. The idea is to provide more data about the experiments that took place. You can still report the best model you obtained after, say, 100 experiments—the result that might be declared “state of the art”—but you also would report the range of performance you would expect if you only had the budget to try it 10 times, or just once.
The point of reproducibility, according to Dodge, isn’t to replicate the results exactly. That would be nearly impossible given the natural randomness in neural networks and variations in hardware and code. Instead, the idea is to offer a road map to reach the same conclusions as the original research, especially when that involves deciding which machine-learning system is best for a particular task.
That could help research become more efficient, Dodge explains. When his team rebuilt some popular machine-learning systems, they found that for some budgets, more antiquated methods made more sense than flashier ones. The idea is to help smaller academic labs by outlining how to get the best bang for their buck. A side benefit, he adds, is that the approach could encourage greener research, given that training large models can require as much energy as the lifetime emissions of a car.
Pineau says she’s heartened to see others trying to “open up the models,” but she’s unsure whether most labs would take advantage of those cost-saving benefits. Many researchers would still feel pressure to use more computers to stay at the cutting edge, and then tackle efficiency later. It’s also tricky to generalize how researchers should report their results, she adds. It’s possible AI2’s “show your work” approach could mask complexities in how researchers select the best models.
Those variations in methods are partly why the NeurIPS reproducibility checklist is voluntary. One stumbling block, especially for industrial labs, is proprietary code and data. If, say, Facebook is doing research with your Instagram photos, there’s an issue with sharing that data publicly. Clinical research involving health data is another sticking point. “We don’t want to move toward cutting off researchers from the community,” she says.
It’s difficult, in other words, to develop reproducibility standards that work without constraining researchers, especially as methods rapidly evolve. But Pineau is optimistic. Another component of the NeurIPS reproducibility effort is a challenge that involves asking other researchers to replicate accepted papers. Compared with other fields, like the life sciences, where old methods die hard, the field is more open to putting researchers in those kinds of sensitive situations. “It’s young both in terms of its people and its technology,” she says. “There’s less inertia to fight.”