Summer Yue

Motivation
When we evaluated o3 and o4-mini on <a href="https://scale.com/leaderboard/humanitys_last_exam">Humanity&rsquo;s Last Exam</a>, we noticed their calibration errors were significantly lower than predecessors. A well-calibrated model is like someone who knows when they are likely to be right or wrong. If a well-calibrated model says it&rsquo;s 70% confident on a set of questions, it should be correct about 70% of the time. Calibration error measures this difference between the model&rsquo;s stated confidence and its actual accuracy &ndash; ideally it&rsquo;s 0%. All models benchmarked so far have exhibited much higher calibration errors. Are the newer generation of reasoning models from OpenAI truly better calibrated?
Results (HLE)
<div dir="ltr" align="left">
<table><colgroup><col width="156"><col width="156"><col width="143"><col width="169"></colgroup>
<tbody>
<tr>
<td>
Model
</td>
<td>
Accuracy (%)
Higher is better
</td>
<td>
Confidence (%)
</td>
<td>
Calibration Error (%)
Lower is better
</td>
</tr>
<tr>
<td>
o3 (high)
</td>
<td style="text-align: left;">
20.3
</td>
<td style="text-align: left;">
55
</td>
<td style="text-align: left;">
34
</td>
</tr>
<tr>
<td>
o4-mini (high)
</td>
<td style="text-align: left;">
18.1
</td>
<td style="text-align: left;">
77
</td>
<td style="text-align: left;">
57
</td>
</tr>
<tr>
<td>
o3-mini (text only, high)
</td>
<td style="text-align: left;">
13.4
</td>
<td style="text-align: left;">
96
</td>
<td style="text-align: left;">
80
</td>
</tr>
<tr>
<td>
o1 (December 2024)
</td>
<td style="text-align: left;">
8.0
</td>
<td style="text-align: left;">
93
</td>
<td style="text-align: left;">
83
</td>
</tr>
</tbody>
</table>
</div>
Setup
We prompt the model to output a confidence score for each response, following the setup from <a href="https://arxiv.org/abs/2411.04368">Wei et al., 2024</a>, then calculate the <a href="https://github.com/hendrycks/outlier-exposure/blob/master/utils/calibration_tools.py">RMS calibration error</a>. The confidence scores and answers are extracted then judged by o3-mini-2025-01-31. Our full evaluation procedure is described on the <a href="https://scale.com/leaderboard/humanitys_last_exam">official HLE leaderboard page</a> and <a href="https://github.com/centerforaisafety/hle">open sourced here</a>.
<pre dir="ltr" style="padding-left: 40px;">Your response should be in the following format: Explanation: {your explanation for your final answer} Exact Answer: {your succinct, final answer} Confidence: {your confidence score between 0% and 100% for your answer}</pre>
HLE Results
The confidence distribution from o3 is nearly uniform, in contrast with o1 which predicts high confidence on most responses. While a broader confidence range is a positive sign, it doesn&rsquo;t imply the scores are meaningful. Calibration requires that the confidence levels match observed accuracies.
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=06cb990b903c190729a8e497453df7c7.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="529">
Random Baseline
One simple baseline is to uniformly and randomly assign a confidence score from 0 to 100 to every answer instead of using the model&rsquo;s own confidence rating, without modifying the answer. The calibration error of this random baseline still differs between models, as different models have different accuracies on HLE. A model would need to outperform this random baseline to show evidence of good calibration. We find o3's calibration error is better than other models, which are significantly worse than random, but o3 is not significantly better than random baseline.
<div dir="ltr" align="left">
<table><colgroup><col width="212"><col width="212"><col width="238"></colgroup>
<tbody>
<tr>
<td>
Model
</td>
<td>
Calibration Error (%) - Model Confidence
</td>
<td>
Calibration Error (%) - Random Baseline
</td>
</tr>
<tr>
<td>
o3 (high)
</td>
<td style="text-align: left;">
34
</td>
<td style="text-align: left;">
36
</td>
</tr>
<tr>
<td>
o4-mini (high)
</td>
<td style="text-align: left;">
57
</td>
<td style="text-align: left;">
39
</td>
</tr>
<tr>
<td>
o3-mini (text only, high)
</td>
<td style="text-align: left;">
80
</td>
<td style="text-align: left;">
42
</td>
</tr>
<tr>
<td>
o1 (December 2024)
</td>
<td style="text-align: left;">
83
</td>
<td style="text-align: left;">
45
</td>
</tr>
</tbody>
</table>
</div>
HLE Calibration Curve
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=6398139229afef842f570f57e9665618.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="842">
We further explore calibration by plotting out the calibration curve. We quantize model confidence into 10 equal-width bins from 0% to 100%, then calculate an accuracy per bin. Error bars are computed using the approximation 1.96 x sqrt(p x (1-p) / n), where p is the accuracy per bin and n is the number of datapoints in the bin.&nbsp;
We see a similar pattern to our earlier findings: o3 has a lower calibration error because its confidence distribution is more uniform. Since most frontier models score low on HLE, simply stating lower confidence on average will lower the calibration error metric. This calibration curve paints a clear picture: there is no strong visual correlation between accuracy and confidence in each bin.
<h2 dir="ltr">GSM8k Exploration</h2>
Are o3 and o4-mini underconfident?
We further tested on <a href="https://huggingface.co/datasets/openai/gsm8k">GSM8k</a>, a dataset of simple math reasoning problems where models achieve nearly 100% accuracy (minus some label/judge errors, i.e. saturated). A necessary but not sufficient condition for a model to be well-calibrated on a high accuracy dataset is giving high confidence on all questions &ndash; models should not only avoid overconfidence on hard tasks, but also avoid underconfidence on easy tasks. Our goal is to see if o3 is broadly underconfident.
GSM8k Results
<div dir="ltr" align="left">
<table><colgroup><col width="156"><col width="156"><col width="145"><col width="167"></colgroup>
<tbody>
<tr>
<td>
Model
</td>
<td>
Accuracy (%)
</td>
<td>
Confidence (%)
</td>
<td>
Calibration Error (%)
</td>
</tr>
<tr>
<td>
o3 (high)
</td>
<td style="text-align: left;">
96.7
</td>
<td style="text-align: left;">
84
</td>
<td style="text-align: left;">
24
</td>
</tr>
<tr>
<td>
o4-mini (high)
</td>
<td style="text-align: left;">
96.9
</td>
<td style="text-align: left;">
99
</td>
<td style="text-align: left;">
3
</td>
</tr>
<tr>
<td>
o3-mini (high)
</td>
<td style="text-align: left;">
96.2
</td>
<td style="text-align: left;">
100
</td>
<td style="text-align: left;">
4
</td>
</tr>
<tr>
<td>
o1 (December 2024)
</td>
<td style="text-align: left;">
96.4
</td>
<td style="text-align: left;">
100
</td>
<td style="text-align: left;">
4
</td>
</tr>
</tbody>
</table>
</div>
We used the same evaluation pipeline from Humanity&rsquo;s Last Exam as described in the previous section. We report results on the train split of GSM8k to have more datapoints for visualization, though we found similar trends on <a href="https://arxiv.org/abs/2405.00332">GSM1k</a> in our exploration. Both GSM8k and GSM1k are saturated, so overfitting is not a concern.
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=e358e5ec9517f8c80c9545384ccf4eb2.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="529">
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=00fc51522dda80cc59f8353e4e4a47d5.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="842">
o1 and o3-mini are omitted from this graph as they tend to predict 100% confidence on all responses. Bins with less than 30 responses are omitted for clarity.
We find o3 is indeed broadly less confident, while o4-mini is better calibrated on this dataset.
Better Confidence Elicitation?
When we manually inspected the outputs from o3, we noticed there was no obvious correlation for when o3 would give a low confidence score. We then hypothesized asking models to explain reasoning for their confidence score might give us a better hint as to why they were giving low confidence on certain questions.
Setup We make a small change to the prompt given to the model, asking for an additional explanation to justify the confidence score.
<pre dir="ltr" style="padding-left: 40px;">Your response should be in the following format: Explanation: {your explanation for your final answer and explanation for your confidence score} Exact Answer: {your succinct, final answer} Confidence: {your confidence score between 0% and 100% for your answer} </pre>
GSM8k Results (Modified Prompt)
<div dir="ltr" align="left">
<table><colgroup><col width="156"><col width="156"><col width="145"><col width="167"></colgroup>
<tbody>
<tr>
<td>
Model
</td>
<td>
Accuracy (%)
</td>
<td>
Confidence (%)
</td>
<td>
Calibration Error (%)
</td>
</tr>
<tr>
<td>
o3 (high)
</td>
<td style="text-align: left;">
96.9
</td>
<td style="text-align: left;">
90
</td>
<td style="text-align: left;">
9
</td>
</tr>
<tr>
<td>
o4-mini (high)
</td>
<td style="text-align: left;">
96.9
</td>
<td style="text-align: left;">
99
</td>
<td style="text-align: left;">
3
</td>
</tr>
<tr>
<td>
o3-mini (high)
</td>
<td style="text-align: left;">
96.3
</td>
<td style="text-align: left;">
100
</td>
<td style="text-align: left;">
4
</td>
</tr>
<tr>
<td>
o1 (December 2024)
</td>
<td style="text-align: left;">
93.3
</td>
<td style="text-align: left;">
100
</td>
<td style="text-align: left;">
4
</td>
</tr>
</tbody>
</table>
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=924eb2954d30ad6b25c5e9db8f16c25e.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="529">
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=d343109fe0a2c14d06bc6fcc92fdf6c1.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="842">
Interestingly, this modified prompt prunes away o3&rsquo;s low confidence scores in the range of [0, 50], reducing the calibration error from 24% down to 9%. Neither accuracy for any model, nor calibration error on other models are affected. Even more interesting, this didn&rsquo;t solve our original problem: o3 still didn&rsquo;t output explanations for their confidence score and didn&rsquo;t address direct instruction to do so. In any case, o3 still remains more underconfident than its predecessors on this dataset.
One final remark is that the confidence distributions of OpenAI&rsquo;s reasoning models do change between GSM8k vs. HLE. This indicates the models are not completely uncalibrated, as their average confidence scores are lower on the harder HLE dataset compared to the easier GSM8k dataset.
Check out the full ranking of models on HLE and to explore other SEAL leaderboards here: <a class="c-link" href="https://scale.com/leaderboard" target="_blank" rel="noopener noreferrer" data-stringify-link="https://scale.com/leaderboard" data-sk="tooltip_parent">https://scale.com/leaderboard</a>
Limitations
Our analysis is conditioned on a fixed prompt for eliciting confidence. We&rsquo;ve already shown that in the case of o3 on GSM8k, changing the prompt changes the calibration of the model. Monte-Carlo methods such as resampling are also shown to be better calibrated on SimpleQA (<a href="https://arxiv.org/abs/2411.04368">Wei et al., 2024</a>), but are prohibitively expensive on challenging reasoning datasets &ndash; 100x the compute to evaluate. Finally, it would be compelling to explore reasoning datasets with difficulty between that of HLE and GSM8k, to paint a broader picture of overall model calibration. We leave a deeper exploration of model calibration across benchmarks to future work.
We thank Miles Turpin and Xiang Deng for their insightful feedback on this blogpost.
&nbsp;
</div>

Are the newer generation of reasoning models from OpenAI truly better calibrated? 

Calibration of OpenAI o3 and o4-mini on Humanity's Last Exam

How Calibrated Are OpenAI’s o3 and o4-mini? A Deep Dive Using Humanity’s Last Exam

As the capabilities of AI systems rapidly increase, it is clear that AI holds a great deal of promise for transforming our world for the better. At the same time, similar to many scientific advancements before it, AI also harbors the potential for malicious use. That is why in 2023, Scale published our&nbsp;<a href="https://scale.com/guides/test-and-evaluation-vision">vision for model test &amp; evaluation</a>, followed by our new frontier research effort, the <a href="https://scale.com/blog/safety-evaluations-analysis-lab">Safety, Evaluations and Alignment Lab</a> (SEAL).
Recently, <a href="https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/">the White House Executive Order on Artificial Intelligence</a> highlighted the risks of LLMs in facilitating the development of bioweapons, chemical weapons, and cyberweapons. Unfortunately, evaluation of such hazardous capabilities has been limited, manual, and only possible for those with relevant domain expertise (or, those with sufficient resources to acquire this expertise).
That&rsquo;s why, today, in partnership with the <a href="https://www.safe.ai/">Center for AI Safety</a>&mdash;the creators of the industry-standard Massive Multitask Language Understanding (<a href="https://arxiv.org/abs/2009.03300">MMLU</a>) benchmark&mdash;Scale is publishing a novel safety evaluation benchmark for large language models: <a href="https://arxiv.org/abs/2403.03218">the Weapons of Mass Destruction Proxy (WMDP)</a>. Covering knowledge across biosecurity, chemical security, and cybersecurity, WMDP serves as an open source proxy measurement for hazardous knowledge contained by LLMs within these domains.
In developing WMDP, a top priority was ensuring that this research would not unintentionally publish hazardous information. To that end, none of WMDP&rsquo;s 4,157 questions are direct info hazards &ndash; the questions were developed in collaboration with a consortium of academics and technical consultants to focus on what we call precursor, correlated, or component knowledge &ndash; in other words, these questions delve into foundational and associated knowledge that is a step away from sensitive or risky information, without actually crossing into hazardous territory (see Figure 1).
<img src="https://lh7-us.googleusercontent.com/gZ0tOTg1xi93A37WIhA6pgX9lPxcoa7WSWb9hOb2VFQMuVYIk1KAxx9jNSboj-Wftd6LkPX132E1N93JXf21LxBkPougYvNOgq49mC-yI3CzI6ToGz2Ol6CwQgV8DLP_Wk1oH5qFTbWnPv2qBmiy2zI" width="624" height="225">
Figure 1: In the left panel, research that aims to develop enhanced potential pandemic pathogens (ePPPs) is a precursor to developing novel viruses. In the center panel, topics in chemistry (e.g., procurement or synthesis) contain questions with a wide variance in hazard level, so we approximate especially sensitive information by collecting questions near the boundary. In the right panel, a cyberweapon requires knowledge of several components, so testing for knowledge of components (esp. those that are primarily offensive in nature) can approximate hazardous knowledge.
&nbsp;
In interpreting model results against WMDP, it is important to keep in mind what the benchmark measures and what it does not. It measures correlated, precursor, or component knowledge to hazardous topics, meaning that if models lack the knowledge covered by WMDP, then they likely lack a substantial amount of hazardous knowledge across the relevant domains. If models do demonstrate knowledge in WMDP, then the likelihood that they contain hazardous knowledge is higher. However, even models with a substantial amount of hazardous knowledge may still lack other requisite capabilities to combine that knowledge in the sequence of steps needed to present a danger.
While measuring risks in LLMs is critical, it&rsquo;s only the first step towards creating safer, more trusted models. It is also extremely important that the broader AI community continues to advance the state of research on how we might act to mitigate these risks and reduce the chances that a model could be leveraged for misuse by a bad actor. To that end, in collaboration with CAIS, we have leveraged WMDP as a benchmark for machine unlearning methods in removing hazardous knowledge from LLMs. We expect methods that lead to unlearning on WDMP to also unlearn the actual hazardous knowledge in these domains.&nbsp;
We hope that WMDP will serve as a foundational step in enabling further study of the problem of mitigating hazardous information in LLMs, and help to defend against LLM misuse while maintaining models&rsquo; educational and productive capabilities.
To read more about the WMDP benchmark and the work Scale and the Center for AI Safety are introducing today, you can reference the full paper&nbsp;<a href="https://arxiv.org/abs/2403.03218">here</a>.

The Weapons of Mass Destruction Proxy (WMDP) is an open-source measure of hazardous knowledge in large language models.

Summer Yue

How Calibrated Are OpenAI’s o3 and o4-mini? A Deep Dive Using Humanity’s Last Exam

Introducing WMDP: Measuring and Mitigating Catastrophic Risk Potential from LLMs