
Udari Madhushani Sehwag1, Elaine Lau1†, Haniyeh Ehsani Oskouie2,5, Shayan Shabihi3, Erich Liang4,5, Andrea Toledo1, Guillermo Mangialardi1, Sergio Fonrouge1, Ed-Yeremai Hernández Cardona1, Paula Vergara1, Utkarsh Tyagi1, Chen Bo Calvin Zhang1, Pavi Bhatter1, Nicholas Johnson1, Furong Huang3, Ernesto Gabriel Hernández Montoya1, and Bing Liu1
1Scale AI, 2University of California, Los Angeles, 3University of Maryland, 4Princeton University, 5Human Frontier Collective, Scale AI
†Work done while at Scale AI
Accelerating scientific progress depends on developing and efficiently allocating resources towards the most promising research directions. In experimental sciences, this often means predicting which experiments will yield meaningful results before committing to costly physical validation. Although existing benchmarks evaluate AI systems on knowledge recall, simulated environments, or theoretical reasoning, assessing their ability to predict outcomes of practical experiments remains underexplored. We introduce SciPredict, a benchmark evaluating whether we can rely on current AI systems to predict experimental outcomes in three key domains: physics, biology, and chemistry. The benchmark comprises of 405 questions derived from recently published empirical studies (post-March 2025), which spans 33 subdomains, requiring models to reason about real experimental systems. Unlike most benchmarks that assess whether AI has reached human-level performance, experimental outcome prediction represents a domain where AI systems could substantially exceed human capabilities, integrating vast cross-domain knowledge, processing complex parameter interactions, and identifying non-obvious patterns that individual researchers cannot readily perceive. This raises two critical questions: can models predict experimental outcomes with sufficient accuracy? and can we identify which predictions are trustworthy? Our analysis reveals fundamental limitations on both fronts. Our evaluations on frontier models show that models accuracy ranges between 14% − 26% and accuracy of human domain experts is ≈ 20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Second, even within this limited performance, models cannot distinguish reliable predictions from unreliable ones. Models only achieve ≈ 20% accuracy even when they self-report very high confidence in their answer and high feasibility in question (i.e., perceiving as it is highly feasible to predict the outcome without running the practical experiment). In contrast, human experts demonstrate strong calibration: the accuracy of human experts increases as they are get more confident in their answers and accuracy increases from ≈ 5% on questions they judge infeasible to ≈ 80% on questions they consider feasible to answer without experimentation. Our findings demonstrate that while frontier models are comparable to human experts in raw predictive accuracy, they fundamentally lack the calibration awareness required for reliable deployment in experimental planning. SciPredict establishes a rigorous evaluation framework for experimental outcome prediction and demonstrates that achieving superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability