Breaking Out of the Lab: Testing AI in Professional Domains

As frontier models continue to climb academic-style benchmarks, it is increasingly clear that there is a misalignment between their scores and their ability to generate day-to-day value for working professionals. Our new benchmark series, PRBench (Professional Reasoning Bench), contributes to solving this problem with real-world, reasoning-heavy problems. The first two benchmarks in the PRBench series focus on Finance and Law.

PRBench joins a crucial push to evaluate models on real-world performance, moving beyond purely academic tests. This effort includes benchmarks for practical skills like freelance projects (Remote Labor Index), software engineering (SWE-Bench Pro), and tool use (MCP Atlas). While benchmarks like Humanity’s Last Exam test the frontier of in-depth knowledge, PRBench acts as its real-world counterpart, assessing how that reasoning holds up in complex, high-stakes professional scenarios.

What's Inside PRBench?

These benchmarks are realistic and difficult, authored by a team of 182 domain experts, including professionals with JDs for law, and CFAs or 6+ years of experience for finance. The questions are based on the experts' actual experiences using chat-based assistants and the types of inquiries they commonly receive from clients. The dataset includes:

1,100 realistic, open-ended tasks.

Finance: 600 samples.
Law: 500 samples.

"Hard" Subsets of the most challenging 300 finance and 250 law questions.
Tasks from a mix of user types:

Finance: 74% Expert-level questions, 26% Non-Expert.
Law: 53% Expert-level questions, 47% Non-Expert.

Scope: The benchmarks are global and comprehensive.

They cover 13 finance topics and 12 legal topics.
They span 114 countries and dependencies globally and 47 US jurisdictions

Multiturn conversations: About 30% of all conversations in the dataset are multi-turn conversations (up to 10 turns).

What's Really at Stake?

To fully capture the professional stakes involved, we went beyond just topic labels. We also analyzed and classified each prompt along two new axes:

Decision Type: This identifies the kind of professional decision the user is trying to make, such as "Planning & Forecasts," "Compliance & Reporting," or "Capital & Funding."
Economic Pathway: This pinpoints the tangible economic consequence the decision affects, like "Risk & Reputation," "Value Creation," or "Claims & Litigation."

This analysis confirms that the vast majority of questions in our dataset are not just informational or academic queries. Instead, they represent high-stakes, real-world decision scenarios that carry tangible downstream economic impacts. This further reinforces why high scores on academic benchmarks don't translate to professional utility, where the cost of an inaccurate, poorly reasoned, or opaque answer is so high.

Reality Check for Frontier Models

Our findings are a clear reality check: while models are proficient at following instructions, they are not yet experts, struggling with complex professional reasoning.

On the full dataset, the charts show a clear performance hierarchy, with top models like GPT-5 Pro achieving scores of 0.51 in Finance and 0.50 in Legal. A distinct gap separates the top tier (GPT-5 Pro, GPT-5, o3) from a large, competitive "mid-pack" that includes models like Claude 4.5 Sonnet, Kimi K2 Thinking, and Gemini 2.5 Pro. The model rankings are consistent across both Finance and Legal.

On the "Hard" subset every model's performance drops significantly, with the top score in Finance-Hard falling to 0.39 and in Legal-Hard to 0.37. While many models can handle standard professional queries, they stumble when faced with true complexity. The performance gaps widen in here; for example, in Legal-Hard, the field compresses, with several proprietary and open-source models clustering tightly around 0.25–0.28.

While all models are exceptionally strong at "Instruction Following," they universally struggle with more nuanced reasoning. Categories like "Supplemental Insight" and "Handling Uncertainty" are clear weak points for all five models across both Finance and Legal. GPT-5 (High) is a consistent top performer, while Gemini 2.5 Pro and Claude 4.5 Sonnet are also highly competitive.

Why Do Models Fail?

1. They Make Inaccurate Judgments

Across both domains, we found that models frequently struggle with the core, high-stakes reasoning that defines these professions.

2. Their Reasoning is Opaque

Even when models reached a correct conclusion, they often did so through incomplete or opaque reasoning processes. This lack of process transparency and auditability makes their answers unusable in a professional setting where clear logic is absolutely essential.

3. They Fail on Diligence

Models tend to perform better on instruction following and practical utility dimensions, but continue to struggle with domain-specific diligence. In other words, they can follow directions but lack the deep, nuanced judgment that is the hallmark of an expert.

4. More Tools Aren't the Answer

Giving models access to tools like web search or a code interpreter had mixed results. The paper notes that the questions were designed to be solvable through reasoning alone. We found that while web search was generally useful for some models, it hurt performance for others and did not ultimately raise the top-tier scores. When tools failed, models often got lost attempting to stitch together irrelevant external content or engage in unnecessary analyses, showing an over-reliance on external sources rather than focusing on the core reasoning task. This echoes results also found in VisualToolBench.

The Path Toward Real-World Value

PRBench demonstrates that frontier models are not ready for prime-time in law and finance. While models are great at following instructions, they still lack the critical domain knowledge, judgment, and contextual nuance.

For developers, these benchmarks provide a roadmap to fix that, moving beyond instruction-following and towards tools you can actually trust and verify, building models that possess genuine, auditable expert reasoning. For enterprise leaders, it’s clear that general-purpose models are not yet appropriate for critical, economically consequential decisions.

This work is part of a wider effort to create benchmarks that accurately reflect real-world use cases, moving beyond academic scores to bridge the gap between AI's potential and its truly useful, reliable, and trustworthy application for professionals. To that end, we have also created an interactive visualizer app to more easily look at the data and filter it by multiple criteria.

What's Inside PRBench?

1,100 realistic, open-ended tasks.

Finance: 600 samples.
Law: 500 samples.

"Hard" Subsets of the most challenging 300 finance and 250 law questions.
Tasks from a mix of user types:

Finance: 74% Expert-level questions, 26% Non-Expert.
Law: 53% Expert-level questions, 47% Non-Expert.

Scope: The benchmarks are global and comprehensive.

They cover 13 finance topics and 12 legal topics.
They span 114 countries and dependencies globally and 47 US jurisdictions

Multiturn conversations: About 30% of all conversations in the dataset are multi-turn conversations (up to 10 turns).

What's Really at Stake?

To fully capture the professional stakes involved, we went beyond just topic labels. We also analyzed and classified each prompt along two new axes:

Decision Type: This identifies the kind of professional decision the user is trying to make, such as "Planning & Forecasts," "Compliance & Reporting," or "Capital & Funding."
Economic Pathway: This pinpoints the tangible economic consequence the decision affects, like "Risk & Reputation," "Value Creation," or "Claims & Litigation."

Reality Check for Frontier Models

Our findings are a clear reality check: while models are proficient at following instructions, they are not yet experts, struggling with complex professional reasoning.

Why Do Models Fail?

1. They Make Inaccurate Judgments

Across both domains, we found that models frequently struggle with the core, high-stakes reasoning that defines these professions.

Breaking Out of the Lab: Testing AI in Professional Domains

What's Inside PRBench?

What's Really at Stake?

Reality Check for Frontier Models

Why Do Models Fail?

1. They Make Inaccurate Judgments

2. Their Reasoning is Opaque

3. They Fail on Diligence

4. More Tools Aren't the Answer

The Path Toward Real-World Value

The future of your industry starts here

Breaking Out of the Lab: Testing AI in Professional Domains

What's Inside PRBench?

What's Really at Stake?

Reality Check for Frontier Models

Why Do Models Fail?

1. They Make Inaccurate Judgments

2. Their Reasoning is Opaque

3. They Fail on Diligence

4. More Tools Aren't the Answer

The Path Toward Real-World Value

The future of your industry starts here