
Afra Feyza Akyürek, Advait Gosai, Chen Bo Calvin Zhang, Vipul Gupta, Jaehwan Jeong,
Anisha Gunjal, Tahseen Rabbani, Maria Mazzone, David Randolph, Mohammad
Mahmoudi Meymand, Gurshaan Chattha, Paula Rodriguez, Diego Mares, Pavit Singh,
Michael Liu, Subodh Chawla, Pete Cline, Lucy Ogaz, Ernesto Hernandez, Zihao Wang, Pavi
Bhatter, Marcos Ayestaran, Bing Liu, and Yunzhong He
See PRBench for Finance here: https://scale.com/leaderboard/prbench-finance
See PRBench for Legal here: https://scale.com/leaderboard/prbench-legal
Explore the data here: https://prbench-explorer.vercel.app/
Frontier model progress is often measured by academic benchmarks, which offer a limited view of performance in real-world professional contexts. This gap is significant, as high-stakes domains like Legal and Finance are common professional use cases yet remain underexplored. Existing evaluations often fail to assess open-ended, economically consequential tasks where practical returns are paramount.
To address this, we introduce Professional Reasoning Bench (PRBench), a realistic, open-ended, and difficult benchmark real-world problems in Finance and Law. We open-source its 1,100 expert-authored tasks and 19,356 expert-curated criteria, making it the largest public, rubric- based benchmark for both legal and finance domains to our knowledge. We recruited 182 qualified professionals, holding JDs, CFAs, or 6+ years of experience, who contributed tasks based on their actual client work. This process yields significant diversity, with tasks spanning 114 countries and 47 US jurisdictions across both Finance and Legal domains.
Our expert-curated rubrics were validated through a rigorous quality pipeline, including inter-rater
agreement analysis and independent expert validation. Subsequent evaluation of 20 leading models reveals substantial room for improvement, with top scores of only 0.39 (Finance) and 0.37 (Legal) on our Hard subsets. We further analyze model performance by using the rubric categories provided by our annotators; revealing that even models with similar overall scores can exhibit large performance disparities on specific capability clusters. Combined with hierarchical clustering on rubrics and ablations, our analysis also reveals common failure modes: including inaccurate judgments, a lack of process transparency, and incomplete reasoning. This highlights critical gaps in reliability for professional adoption.