Nature Study: Human Scientists Still Beat the Best AI Agents on Complex Research Tasks

What Happened

Researchers benchmarked leading AI agents against expert scientists on open-ended tasks requiring experimental design, literature synthesis, and iterative reasoning. Humans won across most task categories, with the agents failing particularly on tasks that required recognizing when a hypothesis needed to be abandoned. The paper tempers widespread claims that frontier models are approaching autonomous research capability.

My Take

This will get spun as a feel-good "humans still matter" result, but that framing buries the real finding: the specific failure mode is sunk-cost blindness — agents don't abandon bad hypotheses. That's a deeply instructive gap because it isn't a scaling problem. No amount of compute fixes it; you need a different objective function. Labs claiming "autonomous scientist" capabilities within 12 months are underestimating how load-bearing that gap is. I'd bet the 2027 AI Index reports only marginal progress on this specific axis.

Read Original Source