Can AI Agents Do the Work of Drug Discovery?

Bringing a single drug to patients can take over a decade and billions of dollars. It begins with drug discovery, and much of that early work is computational. AI agents should be good at this kind of work, but how do today's top agents actually measure up? And how might their performance be improved? Scale Labs and Phylo built DrugDiscoveryBench to find out.

DrugDiscoveryBench measures and ranks the ability of the top AI agents to look up molecular structures, calculate chemical properties, and mine patents and genetic databases, tasks essential to early-stage drug discovery. It gives a grounded view of where they’re succeeding and failing, and identifies two ways to improve them.

For technical readers: full methodology, scoring rules, and per-setting results live on the leaderboard page.

How Today’s Top Agents Score

Even at their best, the strongest agents solved about half of the 82 problems written by drug discovery experts. The leaders clustered within a few points of one another, all from OpenAI, Anthropic, and Google, while the next-best agents from other companies trailed far behind. And a handful of problems stumped every agent we tried. The scores are not fixed, though; with the right changes, agents solved problems they had failed before, and even some weaker performers gained ground.

Where Agents Succeed

Agents do best on short, well-defined work, where a single question has a few clear steps and the answer sits one search or calculation away. That covers a meaningful share of early discovery, like pulling a fact off a known record, running a standard calculation, or comparing a small set of items against each other. The leading agents are reliable about finishing what they're given, and the same agent given more time to think can improve by close to 20 points without any other change.

Where Agents Fall Short

The work that breaks agents is the long, multi-step kind. Tasks with seven or eight dependent steps, where the final answer depends on getting each step right and carrying the original question through to the finish. What happens, again and again, is that agents lose the thread of what was asked. Sometimes a qualifier from the original question quietly drops off in the middle. Sometimes the agent takes a shortcut that throws away evidence it was supposed to use. Sometimes it works on the wrong subject entirely and never notices. In each case the agent finishes the task, submits a confident answer, and is wrong.

What Helps Them Improve

Two changes boost agent performance. The first is giving the agent a plan. The second is changing the system the agent runs in.

After giving the agents a step-by-step plan from an expert, tasks that had scored zero unaided reached perfect or near-perfect scores. The agent still did all the work: it ran every query, parsed every record, computed every number. It was just told which steps to take, in what order, with which tools, without revealing the answer. That suggests a company’s own scientific expertise and institutional know-how remain the real competitive advantage. The model is important, but how you guide it matters just as much.

The second change involves the agent itself. An agent is made of the same model most users use day to day, plus an added layer called a harness. The harness controls how long the agent keeps working, how it handles tools, and what it does when something goes wrong. Every agent has one, but in the case of drug discovery tasks, not all of them perform equally. Some agents gained as much as 17 points simply by switching theirs (reading our leaderboard, you'll see the model name with its setup in parentheses).

What DrugDiscoveryBench Reveals

Agents are competent on the science at a high level, but lose tasks to a mix of failures: scientific common sense slipping and the workflow not holding together across many steps. What would close that gap is mostly available today, and mostly sits outside the model. That changes what progress in this work looks like.

Much of early discovery is slow, manual work: searching databases, sifting through results, and narrowing down candidates.

If AI can shoulder even part of that, it gets researchers from raw data to promising leads far faster. What stays with scientists is the harder part: deciding what to pursue, what to make of the results, and what to do next.