Anthropic Launches BioMysteryBench Eval Showing Claude Solves 30% of Expert-Stumping Bio Problems

Details

Anthropic released BioMysteryBench, a new bioinformatics benchmark with 99 real-world problems using messy biological datasets to test AI's creative research solutions.
Experts were stumped on 23 problems; Anthropic's latest Claude models, including Mythos, solved about 30% of those and most others.
Benchmark features method-agnostic evaluation, objective ground-truth answers from data properties, and superhuman questions humans couldn't solve.
Domain experts solved 76 of 99 tasks; Claude matches or exceeds panels of five experts on many, using diverse strategies.
Capabilities improved rapidly across Claude generations, outperforming humans on some human-difficult bioinformatics tasks like organism identification from crystal structures or viral detection via RNA-seq.

Impact

Anthropic's BioMysteryBench positions Claude as a leader in AI-driven bioinformatics, solving 30% of tasks that stumped expert panels—outpacing rivals like GPT-4o and Claude 3.5 Sonnet, which scored only 17% on similar BixBench open-answer tasks from early 2025. This advances autonomous scientific discovery, potentially accelerating biological research by handling noisy, open-ended problems beyond human limits. It pressures competitors to enhance agentic capabilities in specialized domains, widening AI's edge in biotech applications amid growing demand for reliable scientific tools.