Scale AI logo

Scale AI logo
  • Enterprise
Book a Demo→
Log In
←Back to Blog

Ziwen Han

1 article

April 17, 2025

Research

How Calibrated Are OpenAI’s o3 and o4-mini? A Deep Dive Using Humanity’s Last Exam

How Calibrated Are OpenAI’s o3 and o4-mini? A Deep Dive Using Humanity’s Last Exam

When we evaluated o3 and o4-mini on Humanity’s Last Exam, we noticed their calibration errors were significantly lower than predecessors. A well-calibrated model is like someone who knows when they are likely to be right or wrong. If a well-calibrated model says it’s 70% confident on a set of questions, it should be correct about 70% of them. Calibration error measures this difference between the model’s stated confidence and its actual accuracy – ideally it’s 0%. All models benchmarked so far have exhibited much higher calibration errors. Are the newer generation of reasoning models from OpenAI truly better calibrated?

Read more

  • Products

    • Scale Data Engine
    • Scale GenAI Platform
    • Scale Donovan
    • Government

      • Public Sector
  • Company

    • About
    • Careers
    • Security
    • Terms
    • Privacy
    • Modern Slavery Statement
  • Resources

    • Blog
    • Contact Us
    • Customers
    • Events
    • Documentation
    • Guides
    • Community
    • Research
  • Guides

    • Data Labeling
    • ML Model Training
    • Diffusion Models
    • Guide to AI for eCommerce
    • Computer Vision Applications
    • Large Language Models
  • Follow Us

Copyright © 2026 Scale AI, Inc. All rights reserved.Terms of Use & Privacy Policy