April 17, 2025
How Calibrated Are OpenAI’s o3 and o4-mini? A Deep Dive Using Humanity’s Last Exam
When we evaluated o3 and o4-mini on Humanity’s Last Exam, we noticed their calibration errors were significantly lower than predecessors. A well-calibrated model is like someone who knows when they are likely to be right or wrong. If a well-calibrated model says it’s 70% confident on a set of questions, it should be correct about 70% of them. Calibration error measures this difference between the model’s stated confidence and its actual accuracy – ideally it’s 0%. All models benchmarked so far have exhibited much higher calibration errors. Are the newer generation of reasoning models from OpenAI truly better calibrated?
Read more
March 7, 2024
Introducing WMDP: Measuring and Mitigating Catastrophic Risk Potential from LLMs
In partnership with the Center for AI Safety, Scale is proud to publish a novel safety evaluation benchmark for large language models: the Weapons of Mass Destruction Proxy (WMDP).
Read more