Scale AI logo
SEAL Logo
Scale AI logo
SEAL Logo
Agentic LBs
Safety LBs
Frontier LBs
Professional Reasoning Benchmark - Finance
Professional Reasoning Benchmark - Legal
Humanity's Last Exam
Humanity's Last Exam (Text Only)
VisualToolBench (VTB)
MultiNRC
MultiChallenge
EnigmaEval
VISTA
TutorBench
Legacy LBs
2025 Scale AI. All rights reserved.

Professional Reasoning Benchmark - Finance

Overview

Professional Reasoning Benchmark (PRBench) is the first benchmark to evaluate LLMs on high-stakes professional reasoning in Finance and Law. While academic benchmarks like MMLU and GPQA have become standard metrics for tracking AI progress, they fail to test for real-world professional utility where domain knowledge, nuanced judgment, and contextual understanding are critical.

What sets PRBench apart is its foundation in reality: the benchmark comprises real questions from 182 domain experts, many carrying direct economic consequences rather than abstract academic exercises. These questions represent the kinds of decisions that affect financial outcomes, legal liability, and business operations every day in professional settings.

Read the paper here: https://scale.com/research/prbench

Check out some of the data here: https://prbench-explorer.vercel.app/

Dataset Design

PRBench was co-designed with 182 professionals who bring deep domain expertise—JDs for Law and CFAs, Masters degrees, or 6+ years of professional experience for Finance. Each task features an expert-curated rubric containing 10-30 weighted criteria, scored from -10 to +10, designed to penalize harmful or incorrect advice while rewarding high-quality, safe responses.

The benchmark emphasizes realistic context, with approximately 30% of the 1,100 tasks structured as multi-turn conversations that build context progressively, mimicking real-world professional interactions. The global scope is unprecedented, with tasks covering 114 countries and 47 distinct US jurisdictions.

The benchmark offers two evaluation sets:

  • Full Dataset: All 1,100 questions (600 Finance, 500 Legal)

  • Hard Subset: The most challenging 550 questions (300 Finance, 250 Legal) specifically designed to test frontier models

Methodology and Core Metrics

PRBench uses an LLM judge (o4 Mini) to evaluate model responses against each rubric criterion, producing a final weighted score that is clipped from 0-1 for standardization.

The benchmark's reliability has been validated through two rigorous processes:

  • Rubric Validation: An independent set of experts validated the rubric criteria, achieving 93.9% agreement on their clarity and validity

  • Judge Validation (IRA): The o4 Mini judge demonstrates 80.2% agreement with human experts, on par with the 79.6% agreement between two human experts

Performance is analyzed across 11 rubric categories, including Legal/Financial Accuracy, Process Transparency & Auditability, and Risk & Ethical Disclosure, providing granular insights into model capabilities.

Data Summary


Finance

Legal

Total Samples

600

500

Hard Subset

300

250

User Expertise

74% Expert / 26% Non-Expert

53% Expert / 47% Non-Expert

Median Criteria per Task

16

17

Total Criteria

10,264

9,092

Key Results and Insights

Current Best Performance (Hard Subset): Even top models fail to breach a 0.40 score, revealing substantial room for improvement in professional reasoning:

  • Finance (Hard): 1. GPT-5 Pro (0.39), 2. GPT-5 (0.39), 3. o3 (High) (0.35)

  • Legal (Hard): 1. GPT-5 Pro (0.37), 2. GPT-5 (0.36), 3. o3 (High) (0.35)

Systematic Weaknesses: Across all models tested, consistent struggles emerge with Process Transparency & Auditability, Handling Uncertainty, and domain-specific diligence. These gaps are particularly concerning for professional applications where explainability and thoroughness are essential.

Key Finding: Models frequently reach correct conclusions through incomplete or opaque reasoning processes, significantly reducing their practical reliability in professional settings where the reasoning path matters as much as the answer.

Efficiency vs. Performance: Conciseness emerges as a key differentiator among similarly performing models. For instance, in Finance tasks, Kimi K2 Thinking achieves comparable performance to Claude 4.5 Sonnet but with substantially shorter, more efficient responses.

Tools Impact: Enabling web search improves performance for models, but can actually hurt others that tend to over-rely on external sources rather than applying professional judgment.

Last updated: November 13, 2025

Performance Comparison

1

gpt-5

51.32±0.17

1

gpt-5-pro

51.06±0.59

3

o3-pro

49.08±0.79

4

o3

47.69±0.45

5

gpt-oss-120b

43.80±1.96

5

claude-sonnet-4-5-20250929

43.79±0.40

5

kimi-k2-thinking

43.41±0.23

8

mistral-medium-latest

39.35±0.75

8

o4-mini

39.22±0.51

8

qwen.qwen3-235b-a22b-2507-v1:0

39.14±0.38

8

gemini-2.5-pro

38.92±0.48

8

gemini-2.5-flash

38.41±0.62

8

kimi-k2-instruct

38.34±0.42

14

claude-opus-4-1-20250805

35.15±1.27

14

deepseek-v3p1

35.09±0.15

15

gpt-4.1

34.32±0.26

17

deepseek-r1-0528

32.67±0.64

18

gpt-4.1-mini

30.45±0.42

19

llama4-maverick-instruct-basic

22.36±0.42

Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.

-

CE: Calibration Error