Research

OpenAI’s PaperBench: Advancing Agentic Evaluation

byon April 23, 2025

In order to push further into the frontier of AI capabilities and build better systems, we need to be able to measure the skills we want them to learn. As these systems improve, there is continually a need for better benchmarks and evaluations. At Scale, we demonstrate our commitment to this process through evaluations included on our continually updated SEAL leaderboards and benchmarks like Humanity’s Last Exam (HLE). 

While HLE probes the frontiers of expert reasoning, others like Enigma evaluate model performance on multistep reasoning tasks and MultiChallenge assess models across diverse, interdisciplinary challenges. However, understanding how agents integrate capabilities in end-to-end workflows like scientific research replication requires a different kind of evaluation. OpenAI's recently introduced PaperBench tackles exactly this integration challenge, evaluating an AI agent's ability to replicate existing AI research, testing skills foundational to conducting its own in the future. Here, we'll explore the design of PaperBench, analyze what its results tell us about the state of agentic AI, and consider the implications for future development and safety.

Measuring Agentic Research Skills

PaperBench measures agentic research skills by evaluating an agent’s ability to replicate recent peer-reviewed publications from a top AI conference using only the paper’s text and no original author code. Unlike benchmarks focused on isolated skills like the ability to code or solve math problems, the core value of PaperBench is assessing how multiple complex capabilities are integrated and orchestrated over a long-horizon workflow. Success in this area requires coordinating technical comprehension, planning, coding complex systems, autonomous execution, debugging, and critical analysis. 

The integration of these skills is foundational if autonomous AI agents are going to conduct original research in complex domains. This testing is particularly significant because of its difficulty; replication is typically challenging and time-consuming even for expert human researchers. However, replication difficulty varies; it is often more straightforward for well-specified research (such as the likely well-documented ICML papers chosen here) and harder for less self-contained work. Furthermore, the time-consuming aspect might stem more from the evaluation of the reproduction attempt rather than the replication process itself in all cases.

Inside PaperBench

The foundation of PaperBench is 20 high-impact papers from the 2024 International Conference on Machine Learning (ICML) that cover a dozen ML topics, including deep reinforcement learning, robustness, and probabilistic methods. Evaluations rely on heavily detailed rubrics, manually created and co-developed with the original paper authors, though PaperBench authors note this process was labor-intensive and poses scalability challenges. These rubrics contain over 8,316 gradable outcomes and enable detailed scoring and partial credit by assessing particular requirement types such as code development, execution, and result matching. To make sure the models’ submitted code is authentic and reproducible, its included reproduce.sh script is executed in a fresh virtual machine. 

The authors make a point, however, to note the limitations of PaperBench. The current 20-paper dataset is limited to a relatively narrow slice of ML research. As with all benchmarks, there's also the risk that future models might inadvertently train on published solutions, potentially inflating scores over time. Additionally, running the full benchmark requires significant compute and API resources, making regular comprehensive evaluations relatively costly for smaller operations.

LLMs as Judges

Evaluations benefit from using LLM judges primarily because they provide automation and scalability, replacing or supplementing resource-intensive human effort. At Scale we use LLM judges for a number of our evaluations. For example, LLM judges enable consistent comparison against ground truth for structured problems (like in HLE) and precise extraction of specific claims for focused analysis (like in Mask). For complex, open-ended tasks where direct judgment fails (like in MultiChallenge), LLMs can still facilitate reliable evaluation when used within carefully designed frameworks, such as answering simple, human-defined rubric questions.

To assess automated judges for PaperBench, the authors introduced JudgeEval, an auxiliary benchmark comparing LLM outputs to human expert grading ('gold labels'). Their analysis using JudgeEval found that the best-performing setup tested – o3-mini-high with custom scaffolding – achieved an F1 score of 0.83 against human grades. This suggests the setup, though imperfect, is a serviceable stand-in for human judges in this context.

PaperBench Findings

As a general human baseline, top machine learning PhDs scored 41.4% after 48 hours of work in the best of three attempts on a subset of three papers. Findings for the AI agents revealed that they struggled with long-horizon tasks; they could formulate and write plans with multiple steps, but they were not able to execute their plans. 

In spite of the models’ limitations, they exhibited “non-trivial capabilities.” The top-performing model was Claude 3.5 Sonnet from Anthropic (with open-source scaffolding) with an average replication score of 21% (Claude 3.7 Sonnet was not evaluated given Anthropic’s API rate limits). The next runner up was OpenAI’s o1 with a score of 13.2%. All other models tested–GPT-4o, o3-mini, DeepSeek-R1, and Gemini 2.0 Flash–scored under 10%. 

On PaperBench Code-Dev, the lighter-weight variant that skips the execution step and assesses only code development, the agents scored much better, with o1 coming in at the front of the pack at 43.4%.

Agentic Progress & Safety

The implications of PaperBench extend far beyond scores and ranking. By pinpointing specific trouble spots, it helps focus research and engineering efforts on improving agent architecture, training techniques, and robustness for reliably executing complex, integrated workflows. Additionally, evaluations like PaperBench are necessary to help understand and navigate the safety landscape of increasingly autonomous AI. The ability to replicate scientific research is a significant milestone towards agents capable of conducting independent research, a capability that could dramatically accelerate AI progress, potentially towards AGI. 

Because the rapid development of such highly autonomous systems presents profound safety challenges, particularly around control and alignment, rigorously measuring progress on foundational capabilities like those demonstrated in research replication becomes essential. Tracking performance via benchmarks like PaperBench helps the AI community anticipate future capabilities, inform risk assessments, develop necessary safety protocols proactively, and guide governance decisions regarding powerful AI systems. This makes such benchmarks integral to the safety and scaling frameworks employed by leading AI labs.

Moving Targets 

Fully functional agentic AI can only be realized if we can reliably measure where we are at and pinpoint the work that still needs to be done. PaperBench is a significant step forward for the field of AI evaluation and like HLE, will have a long shelf-life for developers. However, achieving a holistic view of agentic development will require combining insights from a variety of evaluations that provide complementary areas like expert reasoning (HLE) and agentic tool proficiency (ToolComp). Though a diverse portfolio of evaluations is ideal, PaperBench fills in a significant gap by assessing how agents integrate complex, overlapping proficiencies that will be invaluable for builders.

Learn More

PaperBench: Evaluating AI’s Ability to Replicate AI Research

SEAL Leaderboards

Humanity’s Last Exam

 


The future of your industry starts here.