AI2 Benchmarks Expose Gaps in AI Science Agents

allenai.org | ksl | Apr 17, 2026 |

Allen Institute for AI published findings from two benchmarks ScienceWorld and DiscoveryWorld - designed to test whether AI agents can actually do science, not just answer questions about it. The results are humbling. Models that ace multiple-choice exams failed over 90% of ScienceWorld's hands-on tasks when first tested, and even current leading models only complete about 20% of DiscoveryWorld's harder challenges, where human scientists with advanced degrees solve 70%. The gap between knowing what a melting point is and figuring out how to measure one turns out to be enormous. As Anthropic, Google DeepMind, and others race to build autonomous research agents, AI2's work is quietly establishing the evaluation infrastructure that will determine whether those claims hold up.

AI2 Benchmarks Expose Gaps in AI Science Agents

// 0 comments