SWE-Bench Pro: Raising the Bar for Agentic Coding

AI agents for software engineering are rapidly advancing, but are benchmarks keeping up? With frontier models scoring so highly on SWE-Bench Verified, we wanted to raise the bar and develop a more realistic, contamination-resistant, human-augmented benchmark. SWE-Bench Pro picks up where SWE-Bench Verified leaves off, with more diverse tasks, increased difficulty, and code that models have not yet seen. On SWE-Bench Pro, the same four frontier models lead the pack, but at considerably lower scores.
A Benchmark to Meet Today’s Coding Needs
SWE-Bench Pro was designed to accurately measure the ability of coding agents to meet the needs of today. It contains: 1,865 total instances (731 public, 858 held-out, and 276 commercial) across 41 repositories (11 public, 12 held-out, and 18 from enterprise startups).
SWE-Bench Pro solves several key challenges in evaluating AI coding agents:
-
Data Contamination
-
The Problem: Many benchmarks use code that models have likely seen during training. This makes it difficult to know if a model is genuinely solving a problem or just recalling a memorized solution.
-
The Solution: We use code that models haven't been trained on. This is sourced from public codebases governed by strong copyleft licenses (e.g., GPL), whose "viral" nature and legal complexities make them highly likely to be excluded from training data, and completely private, commercial codebases from Scale's internal assets.
-
Limited Task Diversity
-
The Problem: Current benchmarks fail to capture the full spectrum of real-world software engineering challenges, often focusing on simple utility libraries.
-
The Solution: We source tasks from a diverse portfolio of complex repositories, including consumer-facing applications, B2B services, and developer tools. Each repository contributes 50-100 tasks to ensure models must genuinely understand the code, not just overfit to a single project's style.
-
Oversimplified Problems & Unrealistic Difficulty
-
The Problem: Previous benchmarks tend to filter out ambiguous or underspecified issues, which doesn't reflect a real developer's workflow.
-
The Solution: We preserve these challenging tasks. Because a developer's original commit messages are often unstructured or incomplete, we use a human-augmented process to enhance them. Human experts produce a clear problem statement and a list of requirements that specify the expected behavior but not how to implement the solution, preserving the core technical challenge. These tasks require substantial changes, averaging 107.4 lines of code across 4.1 files.
-
Unreliable and Irreproducible Testing
-
The Problem: Without a consistent setup, it's hard to know if a solution works or if the environment is just configured incorrectly.
- The Solution: Problems are harvested via "commit scraping," which captures a broader set of valid issues than scraping pull requests alone. Every task is then evaluated in a reproducible, containerized environment with all dependencies included, from isolated Python virtual environments to module-aware Go environments. Our evaluation automatically verifies that the agent's patch fixes the intended issue ("fail-to-pass") and doesn't break existing functionality ("pass-to-pass").
Results: SWE-Bench Verified vs. SWE-Bench Pro
We ran frontier models on Pro using the SWE-Agent scaffold and here’s what we found (all charts reflect the public dataset):
Massive Performance Drop on SWE-Bench Pro: A major finding is the significant drop in performance for all models when moving from the SWE-Bench Verified benchmark to the more challenging SWE-Bench Pro. While most top models score over 70% on the verified version, the best-performing models, OpenAI GPT-5 and Claude Opus 4.1, score only 23.3% and 23.1% respectively on SWE-Bench Pro. This highlights the increased difficulty and realism of the new benchmark.
The Private Commercial Subset is Harder: The private commercial subset of the SWE-Bench Pro leaderboard reveals a drop in performance. Claude Opus 4.1 decreases from 22.7% to 17.8% resolution, and OpenAI GPT-5 falls from 23.1% to 14.9%. This shows that evaluation on private, previously unseen codebases provides a more realistic measure of generalization.
Significant Performance Gaps Between Models: There is a wide performance disparity among the tested AI models. Frontier models substantially outperform older models like OpenAI GPT-4o (4.9%) and DeepSeek Qwen-3 32B (3.4%). This suggests that the advanced capabilities of the latest models are critical for tackling these complex, real-world software engineering tasks.
Performance Varies by Programming Language: Models show different success rates depending on the programming language. Go and Python tasks generally have higher resolution rates, with some models exceeding 30%. In contrast, performance on JavaScript (JS) and TypeScript (TS) is more varied and often lower, with rates ranging from almost 0% to over 30% depending on the specific model.
Repository-Specific Difficulty: Model performance is heavily influenced by the specific repository the task comes from. Some repositories proved consistently difficult for all models, with resolve rates below 10%. On other repositories, certain models could achieve success rates higher than 50%. This indicates that factors like codebase complexity, problem type, or documentation quality significantly impact an agent's ability to succeed.
Top Models are More Consistent: The highest-performing models, Claude Opus 4.1 and OpenAI GPT-5, not only achieve the highest scores but also demonstrate more stable performance across the different languages and repositories. Smaller models tend to have more "erratic" performance, succeeding moderately on some repositories while failing almost completely on others. This suggests that top models have more robust and generalizable problem-solving skills, a quality that average scores alone don't fully capture.
What SWE-Bench Pro Results Mean
For Developers & Engineering Leaders: Use these results to plan deployments strategically; since an agent's success varies significantly by programming language and repository complexity, you can target the specific teams and codebases where the technology will be most effective. Given that even top agents still struggle with the majority of non-trivial tasks, ensure human oversight and review remain a critical part of your workflow. The only true measure of an agent's utility is its performance on your own internal repositories.
For AI Researchers: SWE-Bench Pro establishes a new, more difficult baseline that measures true generalization over memorization. The performance drop on the commercial codebase subset is a critical finding, demonstrating that current models are less capable at solving novel problems in commercial codebases. Future research must prioritize the key failure modes identified here: navigating large, unfamiliar codebases, executing high-precision edits across multiple files, and overcoming the specific complexities of ecosystems like JavaScript and TypeScript. Progress on old benchmarks is no longer a sufficient measure of advancement.
Learn More
To provide the clearest possible picture of model performance, we will be maintaining two separate leaderboards:
-
A public leaderboard showing performance on tasks from the public, copyleft repositories.
-
A commercial leaderboard reporting results exclusively on the tasks from our private, commercial codebases, serving as a measure of true generalization.
Read the full research paper.
Access the dataset and environments.
View the live leaderboards.