
Today, we add several new models to Showdown:
OpenAI’s GPT-5;
Claude Sonnet 4.5 and Haiku 4.5, with and without extended thinking;
Deepseek-R1-0528; and
Claude Sonnet 4.5 joins Claude Opus 4.1 and GPT-5 Chat at the top of the leaderboard, followed by Qwen3 and Claude Sonnet 4.5 (extended thinking). To see the full rankings, please visit the Showdown leaderboard.
A surprising finding is that users consistently rank GPT-5 significantly lower than other models. In this blog post, we share our preliminary analysis of GPT-5's ranking on Showdown, where we examine the effect of thinking effort, task type, and evaluation setting.
OpenAI’s GPT-5 series of models consists of multiple variants. Showdown already hosts gpt-5-chat, the “non-reasoning model used in ChatGPT”. In today’s update, we add gpt-5, the reasoning model that “powers maximum performance in ChatGPT”, using the default level of reasoning effort (“medium”). We disambiguate between these variants by referring to the thinking version of GPT-5 as “GPT-5” and the chat-optimized snapshot as “GPT-5 Chat”. Where thinking effort levels are relevant, we include the appropriate suffix (“GPT-5-low”).
We find that Showdown users rank GPT-5 significantly lower than other models—a surprising finding, in light of the model’s high ranking on other industry benchmarks. To corroborate this finding, we conduct follow-up analyses, where we find evidence that:
As thinking effort increases, GPT-5’s performance on Showdown degrades. We show that users systematically rank GPT-5-low above GPT-5-medium, and GPT-5-medium above GPT-5-high.
GPT-5 does not show significant performance improvements on coding prompts. We show this by computing rankings on battles containing coding or non-coding tasks.
To understand the effect of Showdown’s in situ evaluation setting on GPT-5’s performance, we collected additional data in a separate setting where human annotators were asked to write prompts and rate paired responses (GPT-5 versus opponent) for the express purpose of model evaluation. Even in this setting, GPT-5 frequently loses battles against other language models.
We were surprised by the performance of GPT-5 on Showdown. Nonetheless, these results show the importance of maintaining benchmarks that measure user preferences in a real-world chat setting. Our findings demonstrate empirically that performance on model capability benchmarks does not directly translate to performance on a real-world chat environment.
At the same time, we recognize the limitations of our current approach. For GPT-5 in particular, we recognize that gpt-5 was optimized for API use cases, including code and agentic settings. Nonetheless, we found the disparity in performance between gpt-5-chat and gpt-5 to be surprising, and worth sharing with the community for open discussion.
For additional information on Showdown, please see our technical report. To see live Showdown rankings, please visit the SEAL Showdown website.
Reasoning models use reasoning tokens in addition to input and output tokens, and typically have parameters controlling the reasoning token budget. For GPT-5, these token budgets are set at three levels: “low”, “medium”, and “high”. Today’s update adds GPT-5-medium to Showdown, the default setting. To understand the impact of different reasoning token budgets, we further collected battles from GPT-5-low and GPT-5-high on a temporary, experimental basis.
Other benchmarks suggest that GPT-5-high should rank highest, followed by GPT-5-medium and GPT-5-low. Surprisingly, we found that on Showdown, GPT-5’s ranking declines as the amount of reasoning effort increases. This suggests that in chat environments, given a fixed model, users disprefer variants of the model with higher reasoning effort. We include rankings before and after style control and find that this result is invariant. For further discussion of the rankings of thinking and non-thinking model variants, see our technical report.
Rankings incorporating Showdown leaderboard models and GPT-5 models, with and without style control. GPT-5 models are bolded. Notably, GPT-5-low ranks above GPT-5-medium, which ranks above GPT-5-high.
| Model | With style control | Without style control |
|---|---|---|
| claude-sonnet-4-5-20250929 | 1 | 1 |
| claude-opus-4-1-20250805 | 1 | 4 |
| gpt-5-chat | 1 | 3 |
| claude-sonnet-4-20250514 | 4 | 8 |
| claude-opus-20250514 | 4 | 8 |
| qwen3-235b-a22b-2507-v1 | 4 | 1 |
| claude-sonnet-4-5-20250929 (Thinking) | 4 | 1 |
| claude-opus-4-1-20250805 (Thinking) | 6 | 6 |
| gpt-4.1-2025-04-14 | 8 | 11 |
| claude-haiku-4-5-20251001 | 8 | 9 |
| claude-opus-4-20250514 (Thinking) | 9 | 10 |
| o3-2025-04-16-medium | 11 | 17 |
| claude-sonnet-4-20250514 (Thinking) | 11 | 11 |
| gemini-2.5-pro-preview-06-05 | 12 | 1 |
| claude-haiku-4-5-20251001 (Thinking) | 12 | 8 |
| gpt-5-2025-08-07-low | 14 | 18 |
| gemini-2.5-flash-preview-05-20 | 17 | 8 |
| o4-mini-2025-04-16-medium | 17 | 20 |
| llama4-maverick-instruct-basic | 18 | 20 |
| gpt-5-2025-08-07-medium | 19 | 18 |
| deepseek-r1-0528 | 21 | 13 |
| gpt-5-2025-08-07-high | 21 | 20 |
According to the release notes for the GPT-5 API, GPT-5 is optimized for coding and agentic use cases. To measure whether GPT-5 underperformance is explained by Showdown’s topic distribution, we classify battle prompts as coding or non-coding. We include some examples of coding and non-coding battle prompts below.
|
Coding prompts
|
|
Non-coding prompts
|
Within Showdown’s chat setting, we fail to find compelling evidence that GPT-5 performance improves on coding prompts. GPT-5-low has slightly improved performance on coding prompts, while GPT-5-medium and -high show a slight degradation in performance. An important limitation of these findings is that coding prompts within Showdown may not reflect GPT-5’s target use case in API or agentic settings.
Rankings for Showdown and GPT-5 models, where the 'code' and 'non-code' rankings were computed on data partitions (determined by an LLM classifier on battle prompts), and the 'combined' ranking was computed on all data.
| Model | Code | Non-code | Combined |
|---|---|---|---|
| claude-opus-4-1-20250805 | 1 | 1 | 1 |
| gpt-5-chat | 4 | 1 | 1 |
| claude-sonnet-4-20250514 | 1 | 3 | 3 |
| qwen3-235b-a22b-2507-v1 | 6 | 3 | 3 |
| claude-opus-4-20250514 | 1 | 3 | 3 |
| claude-opus-4-1-20250805 (Thinking) | 1 | 5 | 5 |
| gpt-4.1-2025-04-14 | 5 | 6 | 5 |
| claude-opus-4-20250514 (Thinking) | 2 | 8 | 8 |
| o3-2025-04-16-medium | 9 | 8 | 9 |
| claude-sonnet-4-20250514 (Thinking) | 4 | 9 | 9 |
| gemini-2.5-pro-preview-06-05 | 6 | 9 | 9 |
| gpt-5-2025-08-07-low | 8 | 10 | 11 |
| gemini-2.5-flash-preview-05-20 | 12 | 13 | 13 |
| o4-mini-2025-04-16-medium | 9 | 13 | 13 |
| llama4-maverick-instruct-basic | 10 | 15 | 14 |
| gpt-5-2025-08-07-medium | 16 | 14 | 15 |
| deepsseek-r1-0528 | 6 | 17 | 17 |
| gpt-5-2025-08-07-high | 18 | 17 | 17 |
One of the key differences between Showdown and other evaluation settings is Showdown’s in-flow evaluation setting, where battles are mined between turns in natural human-LLM conversations. We hypothesized that this setting could explain GPT-5’s performance. To study this hypothesis, we ran an experiment where we attempted to collect human preferences in an adversarial setting.
In this experiment, we tasked human annotators with writing prompts for the express purpose of understanding performance differences between GPT-5 and other models hosted on Showdown. In this controlled setting, we found that GPT-5 maintained a low win rate for most opponent models (N=846). Strikingly, this result suggests that even when annotators are explicitly asked to vote on the basis of measuring model capability, GPT-5 is dispreferred to other models.
Results from the adversarial-setting experiment. We report GPT-5’s win rate against other models hosted on Showdown, excluding ties. With the exception of o4-mini, GPT-5 loses against all other models in the majority of the battles collected in this experiment.
| Opponent | Win Rate | N |
| gemini-2.5-flash-preview-05-20 | 0.25 | 79 |
| claude-opus-4-1-20250805 | 0.26 | 82 |
| gemini-2.5-pro-preview-06-05 | 0.28 | 74 |
| claude-opus-4-1-20250805 (Thinking) | 0.31 | 84 |
| gpt-4.1-2025-04-14 | 0.31 | 80 |
| claude-sonnet-4-20250514 (Thinking) | 0.34 | 79 |
| claude-opus-4-20250514 | 0.35 | 81 |
| claude-sonnet-4-20250514 | 0.35 | 78 |
| claude-opus-4-20250514 (Thinking) | 0.35 | 79 |
| o3-2025-04-16-medium | 0.45 | 69 |
| llama4-maverick-instruct-basic | 0.46 | 78 |
| o4-mini-2025-04-16-medium | 0.53 | 55 |
We were surprised by the performance of GPT-5 on Showdown, which represents a significant departure from the model's high ranking on capability benchmarks and its reception by users. Our analysis consistently shows that users disprefer GPT-5's reasoning variants, with performance degrading as reasoning effort increases. We corroborated this finding in a separate, adversarial setting, suggesting this disparity is not an artifact of Showdown’s in-situ evaluation setting. While we recognize that GPT-5 was optimized for API-driven, agentic use cases, our results demonstrate empirically that performance on capability benchmarks does not directly translate to user preferences.
We share these early findings to foster an open discussion with the community. We will continue to share our findings from Showdown in subsequent research blog posts.