April 17, 2025

When we evaluated o3 and o4-mini on Humanity’s Last Exam, we noticed their calibration errors were significantly lower than predecessors. A well-calibrated model is like someone who knows when they are likely to be right or wrong. If a well-calibrated model says it’s 70% confident on a set of questions, it should be correct about 70% of them. Calibration error measures this difference between the model’s stated confidence and its actual accuracy – ideally it’s 0%. All models benchmarked so far have exhibited much higher calibration errors. Are the newer generation of reasoning models from OpenAI truly better calibrated?
Read more
March 6, 2024

In partnership with the Center for AI Safety, Scale is proud to publish a novel safety evaluation benchmark for large language models: the Weapons of Mass Destruction Proxy (WMDP).
Read more