Ziwen Han

Motivation
When we evaluated o3 and o4-mini on <a href="https://scale.com/leaderboard/humanitys_last_exam">Humanity&rsquo;s Last Exam</a>, we noticed their calibration errors were significantly lower than predecessors. A well-calibrated model is like someone who knows when they are likely to be right or wrong. If a well-calibrated model says it&rsquo;s 70% confident on a set of questions, it should be correct about 70% of the time. Calibration error measures this difference between the model&rsquo;s stated confidence and its actual accuracy &ndash; ideally it&rsquo;s 0%. All models benchmarked so far have exhibited much higher calibration errors. Are the newer generation of reasoning models from OpenAI truly better calibrated?
Results (HLE)
<div dir="ltr" align="left">
<table><colgroup><col width="156"><col width="156"><col width="143"><col width="169"></colgroup>
<tbody>
<tr>
<td>
Model
</td>
<td>
Accuracy (%)
Higher is better
</td>
<td>
Confidence (%)
</td>
<td>
Calibration Error (%)
Lower is better
</td>
</tr>
<tr>
<td>
o3 (high)
</td>
<td style="text-align: left;">
20.3
</td>
<td style="text-align: left;">
55
</td>
<td style="text-align: left;">
34
</td>
</tr>
<tr>
<td>
o4-mini (high)
</td>
<td style="text-align: left;">
18.1
</td>
<td style="text-align: left;">
77
</td>
<td style="text-align: left;">
57
</td>
</tr>
<tr>
<td>
o3-mini (text only, high)
</td>
<td style="text-align: left;">
13.4
</td>
<td style="text-align: left;">
96
</td>
<td style="text-align: left;">
80
</td>
</tr>
<tr>
<td>
o1 (December 2024)
</td>
<td style="text-align: left;">
8.0
</td>
<td style="text-align: left;">
93
</td>
<td style="text-align: left;">
83
</td>
</tr>
</tbody>
</table>
</div>
Setup
We prompt the model to output a confidence score for each response, following the setup from <a href="https://arxiv.org/abs/2411.04368">Wei et al., 2024</a>, then calculate the <a href="https://github.com/hendrycks/outlier-exposure/blob/master/utils/calibration_tools.py">RMS calibration error</a>. The confidence scores and answers are extracted then judged by o3-mini-2025-01-31. Our full evaluation procedure is described on the <a href="https://scale.com/leaderboard/humanitys_last_exam">official HLE leaderboard page</a> and <a href="https://github.com/centerforaisafety/hle">open sourced here</a>.
<pre dir="ltr" style="padding-left: 40px;">Your response should be in the following format: Explanation: {your explanation for your final answer} Exact Answer: {your succinct, final answer} Confidence: {your confidence score between 0% and 100% for your answer}</pre>
HLE Results
The confidence distribution from o3 is nearly uniform, in contrast with o1 which predicts high confidence on most responses. While a broader confidence range is a positive sign, it doesn&rsquo;t imply the scores are meaningful. Calibration requires that the confidence levels match observed accuracies.
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=06cb990b903c190729a8e497453df7c7.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="529">
Random Baseline
One simple baseline is to uniformly and randomly assign a confidence score from 0 to 100 to every answer instead of using the model&rsquo;s own confidence rating, without modifying the answer. The calibration error of this random baseline still differs between models, as different models have different accuracies on HLE. A model would need to outperform this random baseline to show evidence of good calibration. We find o3's calibration error is better than other models, which are significantly worse than random, but o3 is not significantly better than random baseline.
<div dir="ltr" align="left">
<table><colgroup><col width="212"><col width="212"><col width="238"></colgroup>
<tbody>
<tr>
<td>
Model
</td>
<td>
Calibration Error (%) - Model Confidence
</td>
<td>
Calibration Error (%) - Random Baseline
</td>
</tr>
<tr>
<td>
o3 (high)
</td>
<td style="text-align: left;">
34
</td>
<td style="text-align: left;">
36
</td>
</tr>
<tr>
<td>
o4-mini (high)
</td>
<td style="text-align: left;">
57
</td>
<td style="text-align: left;">
39
</td>
</tr>
<tr>
<td>
o3-mini (text only, high)
</td>
<td style="text-align: left;">
80
</td>
<td style="text-align: left;">
42
</td>
</tr>
<tr>
<td>
o1 (December 2024)
</td>
<td style="text-align: left;">
83
</td>
<td style="text-align: left;">
45
</td>
</tr>
</tbody>
</table>
</div>
HLE Calibration Curve
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=6398139229afef842f570f57e9665618.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="842">
We further explore calibration by plotting out the calibration curve. We quantize model confidence into 10 equal-width bins from 0% to 100%, then calculate an accuracy per bin. Error bars are computed using the approximation 1.96 x sqrt(p x (1-p) / n), where p is the accuracy per bin and n is the number of datapoints in the bin.&nbsp;
We see a similar pattern to our earlier findings: o3 has a lower calibration error because its confidence distribution is more uniform. Since most frontier models score low on HLE, simply stating lower confidence on average will lower the calibration error metric. This calibration curve paints a clear picture: there is no strong visual correlation between accuracy and confidence in each bin.
<h2 dir="ltr">GSM8k Exploration</h2>
Are o3 and o4-mini underconfident?
We further tested on <a href="https://huggingface.co/datasets/openai/gsm8k">GSM8k</a>, a dataset of simple math reasoning problems where models achieve nearly 100% accuracy (minus some label/judge errors, i.e. saturated). A necessary but not sufficient condition for a model to be well-calibrated on a high accuracy dataset is giving high confidence on all questions &ndash; models should not only avoid overconfidence on hard tasks, but also avoid underconfidence on easy tasks. Our goal is to see if o3 is broadly underconfident.
GSM8k Results
<div dir="ltr" align="left">
<table><colgroup><col width="156"><col width="156"><col width="145"><col width="167"></colgroup>
<tbody>
<tr>
<td>
Model
</td>
<td>
Accuracy (%)
</td>
<td>
Confidence (%)
</td>
<td>
Calibration Error (%)
</td>
</tr>
<tr>
<td>
o3 (high)
</td>
<td style="text-align: left;">
96.7
</td>
<td style="text-align: left;">
84
</td>
<td style="text-align: left;">
24
</td>
</tr>
<tr>
<td>
o4-mini (high)
</td>
<td style="text-align: left;">
96.9
</td>
<td style="text-align: left;">
99
</td>
<td style="text-align: left;">
3
</td>
</tr>
<tr>
<td>
o3-mini (high)
</td>
<td style="text-align: left;">
96.2
</td>
<td style="text-align: left;">
100
</td>
<td style="text-align: left;">
4
</td>
</tr>
<tr>
<td>
o1 (December 2024)
</td>
<td style="text-align: left;">
96.4
</td>
<td style="text-align: left;">
100
</td>
<td style="text-align: left;">
4
</td>
</tr>
</tbody>
</table>
</div>
We used the same evaluation pipeline from Humanity&rsquo;s Last Exam as described in the previous section. We report results on the train split of GSM8k to have more datapoints for visualization, though we found similar trends on <a href="https://arxiv.org/abs/2405.00332">GSM1k</a> in our exploration. Both GSM8k and GSM1k are saturated, so overfitting is not a concern.
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=e358e5ec9517f8c80c9545384ccf4eb2.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="529">
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=00fc51522dda80cc59f8353e4e4a47d5.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="842">
o1 and o3-mini are omitted from this graph as they tend to predict 100% confidence on all responses. Bins with less than 30 responses are omitted for clarity.
We find o3 is indeed broadly less confident, while o4-mini is better calibrated on this dataset.
Better Confidence Elicitation?
When we manually inspected the outputs from o3, we noticed there was no obvious correlation for when o3 would give a low confidence score. We then hypothesized asking models to explain reasoning for their confidence score might give us a better hint as to why they were giving low confidence on certain questions.
Setup We make a small change to the prompt given to the model, asking for an additional explanation to justify the confidence score.
<pre dir="ltr" style="padding-left: 40px;">Your response should be in the following format: Explanation: {your explanation for your final answer and explanation for your confidence score} Exact Answer: {your succinct, final answer} Confidence: {your confidence score between 0% and 100% for your answer} </pre>
GSM8k Results (Modified Prompt)
<div dir="ltr" align="left">
<table><colgroup><col width="156"><col width="156"><col width="145"><col width="167"></colgroup>
<tbody>
<tr>
<td>
Model
</td>
<td>
Accuracy (%)
</td>
<td>
Confidence (%)
</td>
<td>
Calibration Error (%)
</td>
</tr>
<tr>
<td>
o3 (high)
</td>
<td style="text-align: left;">
96.9
</td>
<td style="text-align: left;">
90
</td>
<td style="text-align: left;">
9
</td>
</tr>
<tr>
<td>
o4-mini (high)
</td>
<td style="text-align: left;">
96.9
</td>
<td style="text-align: left;">
99
</td>
<td style="text-align: left;">
3
</td>
</tr>
<tr>
<td>
o3-mini (high)
</td>
<td style="text-align: left;">
96.3
</td>
<td style="text-align: left;">
100
</td>
<td style="text-align: left;">
4
</td>
</tr>
<tr>
<td>
o1 (December 2024)
</td>
<td style="text-align: left;">
93.3
</td>
<td style="text-align: left;">
100
</td>
<td style="text-align: left;">
4
</td>
</tr>
</tbody>
</table>
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=924eb2954d30ad6b25c5e9db8f16c25e.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="529">
<img src="https://img.plasmic.app/img-optimizer/v1/img?src=d343109fe0a2c14d06bc6fcc92fdf6c1.jpg&amp;f=webp&amp;q=75" alt="" width="863" height="842">
Interestingly, this modified prompt prunes away o3&rsquo;s low confidence scores in the range of [0, 50], reducing the calibration error from 24% down to 9%. Neither accuracy for any model, nor calibration error on other models are affected. Even more interesting, this didn&rsquo;t solve our original problem: o3 still didn&rsquo;t output explanations for their confidence score and didn&rsquo;t address direct instruction to do so. In any case, o3 still remains more underconfident than its predecessors on this dataset.
One final remark is that the confidence distributions of OpenAI&rsquo;s reasoning models do change between GSM8k vs. HLE. This indicates the models are not completely uncalibrated, as their average confidence scores are lower on the harder HLE dataset compared to the easier GSM8k dataset.
Check out the full ranking of models on HLE and to explore other SEAL leaderboards here: <a class="c-link" href="https://scale.com/leaderboard" target="_blank" rel="noopener noreferrer" data-stringify-link="https://scale.com/leaderboard" data-sk="tooltip_parent">https://scale.com/leaderboard</a>
Limitations
Our analysis is conditioned on a fixed prompt for eliciting confidence. We&rsquo;ve already shown that in the case of o3 on GSM8k, changing the prompt changes the calibration of the model. Monte-Carlo methods such as resampling are also shown to be better calibrated on SimpleQA (<a href="https://arxiv.org/abs/2411.04368">Wei et al., 2024</a>), but are prohibitively expensive on challenging reasoning datasets &ndash; 100x the compute to evaluate. Finally, it would be compelling to explore reasoning datasets with difficulty between that of HLE and GSM8k, to paint a broader picture of overall model calibration. We leave a deeper exploration of model calibration across benchmarks to future work.
We thank Miles Turpin and Xiang Deng for their insightful feedback on this blogpost.
&nbsp;
</div>

Are the newer generation of reasoning models from OpenAI truly better calibrated? 

Ziwen Han

How Calibrated Are OpenAI’s o3 and o4-mini? A Deep Dive Using Humanity’s Last Exam