
Manasi Sharma1, Chen Bo Calvin Zhang1, Chaithanya Bandi1, Clinton Wang†, Ankit Aich1, Huy Nghiem2, Tahseen Rabbani3, Ye Htet4, Brian Jang1 , Sumana Basu5 , Aishwarya Balwani1, Denis Peskoff6 , Marcos Ayestaran1 , Sean M. Hendryx†, Brad Kenstler1, Bing Liu1
1Scale AI, 2University of Maryland, 3University of Chicago, 4Washington University, St. Louis, 5McGill University, 6University of California, Berkeley
†Work conducted while at Scale AI
Deep Research (DR) is an emerging agent application that leverages large language models (LLMs) to address openended queries. It requires the integration of several capabilities, including multi-step reasoning, cross-document synthesis, and the generation of evidence-backed, long-form answers. Evaluating DR remains challenging because responses are lengthy and diverse, admit many valid solutions, and often depend on dynamic information sources. We introduce ResearchRubrics, a standardized benchmark for DR built with over 2,800+ hours of human labor that pairs realistic, domain-diverse prompts with 2,500+ expert-written, fine-grained rubrics to assess factual grounding, reasoning soundness, and clarity. We also propose a new complexity framework for categorizing DR tasks along three axes: conceptual breadth, logical nesting, and exploration. In addition, we develop human and model-based evaluation protocols that measure rubric adherence for DR agents. We evaluate several state-of-the-art DR systems and find that even leading agents like Gemini’s DR and OpenAI’s DR achieve under 68% average compliance with our rubrics, primarily due to missed implicit context and inadequate reasoning about retrieved information. Our results highlight the need for robust, scalable assessment of deep research capabilities, to which end we release ResearchRubrics (including all prompts, rubrics, and evaluation code) to facilitate progress toward well-justified research assistants.