SEAL Showdown: Insights from GPT-5

Today, we add several new models to Showdown:

OpenAI’s GPT-5;
Claude Sonnet 4.5 and Haiku 4.5, with and without extended thinking;
Deepseek-R1-0528; and
Qwen3-235B-A22B.

Claude Sonnet 4.5 joins Claude Opus 4.1 and GPT-5 Chat at the top of the leaderboard, followed by Qwen3 and Claude Sonnet 4.5 (extended thinking). To see the full rankings, please visit the Showdown leaderboard.

A surprising finding is that users consistently rank GPT-5 significantly lower than other models. In this blog post, we share our preliminary analysis of GPT-5's ranking on Showdown, where we examine the effect of thinking effort, task type, and evaluation setting.

OpenAI’s GPT-5 series of models consists of multiple variants. Showdown already hosts gpt-5-chat, the “non-reasoning model used in ChatGPT”. In today’s update, we add gpt-5, the reasoning model that “powers maximum performance in ChatGPT”, using the default level of reasoning effort (“medium”). We disambiguate between these variants by referring to the thinking version of GPT-5 as “GPT-5” and the chat-optimized snapshot as “GPT-5 Chat”. Where thinking effort levels are relevant, we include the appropriate suffix (“GPT-5-low”).

We find that Showdown users rank GPT-5 significantly lower than other models—a surprising finding, in light of the model’s high ranking on other industry benchmarks. To corroborate this finding, we conduct follow-up analyses, where we find evidence that:

As thinking effort increases, GPT-5’s performance on Showdown degrades. We show that users systematically rank GPT-5-low above GPT-5-medium, and GPT-5-medium above GPT-5-high.
GPT-5 does not show significant performance improvements on coding prompts. We show this by computing rankings on battles containing coding or non-coding tasks.
To understand the effect of Showdown’s in situ evaluation setting on GPT-5’s performance, we collected additional data in a separate setting where human annotators were asked to write prompts and rate paired responses (GPT-5 versus opponent) for the express purpose of model evaluation. Even in this setting, GPT-5 frequently loses battles against other language models.

We were surprised by the performance of GPT-5 on Showdown. Nonetheless, these results show the importance of maintaining benchmarks that measure user preferences in a real-world chat setting. Our findings demonstrate empirically that performance on model capability benchmarks does not directly translate to performance on a real-world chat environment.

At the same time, we recognize the limitations of our current approach. For GPT-5 in particular, we recognize that gpt-5 was optimized for API use cases, including code and agentic settings. Nonetheless, we found the disparity in performance between gpt-5-chat and gpt-5 to be surprising, and worth sharing with the community for open discussion.

For additional information on Showdown, please see our technical report. To see live Showdown rankings, please visit the SEAL Showdown website.

How does thinking effort affect GPT-5’s Showdown performance?

Reasoning models use reasoning tokens in addition to input and output tokens, and typically have parameters controlling the reasoning token budget. For GPT-5, these token budgets are set at three levels: “low”, “medium”, and “high”. Today’s update adds GPT-5-medium to Showdown, the default setting. To understand the impact of different reasoning token budgets, we further collected battles from GPT-5-low and GPT-5-high on a temporary, experimental basis.

Other benchmarks suggest that GPT-5-high should rank highest, followed by GPT-5-medium and GPT-5-low. Surprisingly, we found that on Showdown, GPT-5’s ranking declines as the amount of reasoning effort increases. This suggests that in chat environments, given a fixed model, users disprefer variants of the model with higher reasoning effort. We include rankings before and after style control and find that this result is invariant. For further discussion of the rankings of thinking and non-thinking model variants, see our technical report.

Rankings incorporating Showdown leaderboard models and GPT-5 models, with and without style control. GPT-5 models are bolded. Notably, GPT-5-low ranks above GPT-5-medium, which ranks above GPT-5-high.

Model	With style control	Without style control
claude-sonnet-4-5-20250929	1	1
claude-opus-4-1-20250805	1	4
gpt-5-chat	1	3
claude-sonnet-4-20250514	4	8
claude-opus-20250514	4	8
qwen3-235b-a22b-2507-v1	4	1
claude-sonnet-4-5-20250929 (Thinking)	4	1
claude-opus-4-1-20250805 (Thinking)	6	6
gpt-4.1-2025-04-14	8	11
claude-haiku-4-5-20251001	8	9
claude-opus-4-20250514 (Thinking)	9	10
o3-2025-04-16-medium	11	17
claude-sonnet-4-20250514 (Thinking)	11	11
gemini-2.5-pro-preview-06-05	12	1
claude-haiku-4-5-20251001 (Thinking)	12	8
gpt-5-2025-08-07-low	14	18
gemini-2.5-flash-preview-05-20	17	8
o4-mini-2025-04-16-medium	17	20
llama4-maverick-instruct-basic	18	20
gpt-5-2025-08-07-medium	19	18
deepseek-r1-0528	21	13
gpt-5-2025-08-07-high	21	20

Does GPT-5 perform better on Showdown coding tasks?

According to the release notes for the GPT-5 API, GPT-5 is optimized for coding and agentic use cases. To measure whether GPT-5 underperformance is explained by Showdown’s topic distribution, we classify battle prompts as coding or non-coding. We include some examples of coding and non-coding battle prompts below.

Coding prompts

“We have a Linux binary that runs a virtual Wayland display…”
“import orjson…”
“okay, I have an svg html tag -- what's the minimum necessary to show that in a web page?”

Non-coding prompts

“Create a 15-item identification test about plant tissues. The degree of difficulty should be…”
“Give me a commission structure for a HaaS business…”
“Pretend that you are a dermatologist and comestician and know everything about the field…”

Within Showdown’s chat setting, we fail to find compelling evidence that GPT-5 performance improves on coding prompts. GPT-5-low has slightly improved performance on coding prompts, while GPT-5-medium and -high show a slight degradation in performance. An important limitation of these findings is that coding prompts within Showdown may not reflect GPT-5’s target use case in API or agentic settings.

Rankings for Showdown and GPT-5 models, where the 'code' and 'non-code' rankings were computed on data partitions (determined by an LLM classifier on battle prompts), and the 'combined' ranking was computed on all data.

Model	Code	Non-code	Combined
claude-opus-4-1-20250805	1	1	1
gpt-5-chat	4	1	1
claude-sonnet-4-20250514	1	3	3
qwen3-235b-a22b-2507-v1	6	3	3
claude-opus-4-20250514	1	3	3
claude-opus-4-1-20250805 (Thinking)	1	5	5
gpt-4.1-2025-04-14	5	6	5
claude-opus-4-20250514 (Thinking)	2	8	8
o3-2025-04-16-medium	9	8	9
claude-sonnet-4-20250514 (Thinking)	4	9	9
gemini-2.5-pro-preview-06-05	6	9	9
gpt-5-2025-08-07-low	8	10	11
gemini-2.5-flash-preview-05-20	12	13	13
o4-mini-2025-04-16-medium	9	13	13
llama4-maverick-instruct-basic	10	15	14
gpt-5-2025-08-07-medium	16	14	15
deepsseek-r1-0528	6	17	17
gpt-5-2025-08-07-high	18	17	17

How does Showdown’s in situ evaluation setting affect GPT-5’s win rate?

One of the key differences between Showdown and other evaluation settings is Showdown’s in-flow evaluation setting, where battles are mined between turns in natural human-LLM conversations. We hypothesized that this setting could explain GPT-5’s performance. To study this hypothesis, we ran an experiment where we attempted to collect human preferences in an adversarial setting.

In this experiment, we tasked human annotators with writing prompts for the express purpose of understanding performance differences between GPT-5 and other models hosted on Showdown. In this controlled setting, we found that GPT-5 maintained a low win rate for most opponent models (N=846). Strikingly, this result suggests that even when annotators are explicitly asked to vote on the basis of measuring model capability, GPT-5 is dispreferred to other models.

Results from the adversarial-setting experiment. We report GPT-5’s win rate against other models hosted on Showdown, excluding ties. With the exception of o4-mini, GPT-5 loses against all other models in the majority of the battles collected in this experiment.

Opponent	Win Rate	N
gemini-2.5-flash-preview-05-20	0.25	79
claude-opus-4-1-20250805	0.26	82
gemini-2.5-pro-preview-06-05	0.28	74
claude-opus-4-1-20250805 (Thinking)	0.31	84
gpt-4.1-2025-04-14	0.31	80
claude-sonnet-4-20250514 (Thinking)	0.34	79
claude-opus-4-20250514	0.35	81
claude-sonnet-4-20250514	0.35	78
claude-opus-4-20250514 (Thinking)	0.35	79
o3-2025-04-16-medium	0.45	69
llama4-maverick-instruct-basic	0.46	78
o4-mini-2025-04-16-medium	0.53	55

We were surprised by the performance of GPT-5 on Showdown, which represents a significant departure from the model's high ranking on capability benchmarks and its reception by users. Our analysis consistently shows that users disprefer GPT-5's reasoning variants, with performance degrading as reasoning effort increases. We corroborated this finding in a separate, adversarial setting, suggesting this disparity is not an artifact of Showdown’s in-situ evaluation setting. While we recognize that GPT-5 was optimized for API-driven, agentic use cases, our results demonstrate empirically that performance on capability benchmarks does not directly translate to user preferences.

We share these early findings to foster an open discussion with the community. We will continue to share our findings from Showdown in subsequent research blog posts.

Model

With style control

Without style control

claude-sonnet-4-5-20250929

claude-opus-4-1-20250805

gpt-5-chat

claude-sonnet-4-20250514

claude-opus-20250514

qwen3-235b-a22b-2507-v1

claude-sonnet-4-5-20250929 (Thinking)

claude-opus-4-1-20250805 (Thinking)

gpt-4.1-2025-04-14

claude-haiku-4-5-20251001

claude-opus-4-20250514 (Thinking)

o3-2025-04-16-medium

claude-sonnet-4-20250514 (Thinking)

gemini-2.5-pro-preview-06-05

claude-haiku-4-5-20251001 (Thinking)

gpt-5-2025-08-07-low

gemini-2.5-flash-preview-05-20

o4-mini-2025-04-16-medium

llama4-maverick-instruct-basic

gpt-5-2025-08-07-medium

deepseek-r1-0528

gpt-5-2025-08-07-high

Model

Code

Non-code

Combined

claude-opus-4-1-20250805

gpt-5-chat

claude-sonnet-4-20250514

qwen3-235b-a22b-2507-v1

claude-opus-4-20250514

claude-opus-4-1-20250805 (Thinking)

gpt-4.1-2025-04-14

claude-opus-4-20250514 (Thinking)

o3-2025-04-16-medium

claude-sonnet-4-20250514 (Thinking)

gemini-2.5-pro-preview-06-05

gpt-5-2025-08-07-low

gemini-2.5-flash-preview-05-20

o4-mini-2025-04-16-medium

llama4-maverick-instruct-basic

gpt-5-2025-08-07-medium

deepsseek-r1-0528

gpt-5-2025-08-07-high

SEAL Showdown: Insights from GPT-5

How does thinking effort affect GPT-5’s Showdown performance?

Does GPT-5 perform better on Showdown coding tasks?

How does Showdown’s in situ evaluation setting affect GPT-5’s win rate?

The future of your industry starts here

SEAL Showdown: Insights from GPT-5

How does thinking effort affect GPT-5’s Showdown performance?

Does GPT-5 perform better on Showdown coding tasks?

How does Showdown’s in situ evaluation setting affect GPT-5’s win rate?

The future of your industry starts here