Scale AI logo

Scale AI logo
  • Enterprise
  • Government
Book a Demo→
Log In
←Back to Blog

Summer Yue

2 articles

April 17, 2025

Research

How Calibrated Are OpenAI’s o3 and o4-mini? A Deep Dive Using Humanity’s Last Exam

How Calibrated Are OpenAI’s o3 and o4-mini? A Deep Dive Using Humanity’s Last Exam

When we evaluated o3 and o4-mini on Humanity’s Last Exam, we noticed their calibration errors were significantly lower than predecessors. A well-calibrated model is like someone who knows when they are likely to be right or wrong. If a well-calibrated model says it’s 70% confident on a set of questions, it should be correct about 70% of them. Calibration error measures this difference between the model’s stated confidence and its actual accuracy – ideally it’s 0%. All models benchmarked so far have exhibited much higher calibration errors. Are the newer generation of reasoning models from OpenAI truly better calibrated?

Read more

March 7, 2024

Engineering

Introducing WMDP: Measuring and Mitigating Catastrophic Risk Potential from LLMs

Introducing WMDP: Measuring and Mitigating Catastrophic Risk Potential from LLMs

In partnership with the Center for AI Safety, Scale is proud to publish a novel safety evaluation benchmark for large language models: the Weapons of Mass Destruction Proxy (WMDP).

Read more

  • Products

    • Scale Data Engine
    • Scale GenAI Platform
    • Scale Donovan
    • Government

      • Public Sector
  • Company

    • About
    • Careers
    • Security
    • Terms
    • Privacy
    • Modern Slavery Statement
  • Resources

    • Blog
    • Contact Us
    • Customers
    • Events
    • Documentation
    • Guides
    • Community
    • Research
  • Guides

    • Data Labeling
    • ML Model Training
    • Diffusion Models
    • Guide to AI for eCommerce
    • Computer Vision Applications
    • Large Language Models
  • Follow Us

Copyright © 2026 Scale AI, Inc. All rights reserved.Terms of Use & Privacy Policy