by David Lee, Jaehwan Jeong, Zihao Wang, Bing Liu, Janie Gu, Hatem Aldawaghreh, Adam Dimanshteyn and Jonathan Lim

Today, we add several new models to Showdown:
Claude Sonnet 4.5 joins Claude Opus 4.1 and GPT-5 Chat at the top of the leaderboard, followed by Qwen3 and Claude Sonnet 4.5 (extended thinking). To see the full rankings, please visit the Showdown leaderboard.
A surprising finding is that users consistently rank GPT-5 significantly lower than other models. In this blog post, we share our preliminary analysis of GPT-5's ranking on Showdown, where we examine the effect of thinking effort, task type, and evaluation setting.
OpenAI’s GPT-5 series of models consists of multiple variants. Showdown already hosts gpt-5-chat, the “non-reasoning model used in ChatGPT”. In today’s update, we add gpt-5, the reasoning model that “powers maximum performance in ChatGPT”, using the default level of reasoning effort (“medium”). We disambiguate between these variants by referring to the thinking version of GPT-5 as “GPT-5” and the chat-optimized snapshot as “GPT-5 Chat”. Where thinking effort levels are relevant, we include the appropriate suffix (“GPT-5-low”).
We find that Showdown users rank GPT-5 significantly lower than other models—a surprising finding, in light of the model’s high ranking on other industry benchmarks. To corroborate this finding, we conduct follow-up analyses, where we find evidence that:
We were surprised by the performance of GPT-5 on Showdown. Nonetheless, these results show the importance of maintaining benchmarks that measure user preferences in a real-world chat setting. Our findings demonstrate empirically that performance on model capability benchmarks does not directly translate to performance on a real-world chat environment.
At the same time, we recognize the limitations of our current approach. For GPT-5 in particular, we recognize that gpt-5 was optimized for API use cases, including code and agentic settings. Nonetheless, we found the disparity in performance between gpt-5-chat and gpt-5 to be surprising, and worth sharing with the community for open discussion.
For additional information on Showdown, please see our technical report. To see live Showdown rankings, please visit the SEAL Showdown website.
Reasoning models use reasoning tokens in addition to input and output tokens, and typically have parameters controlling the reasoning token budget. For GPT-5, these token budgets are set at three levels: “low”, “medium”, and “high”. Today’s update adds GPT-5-medium to Showdown, the default setting. To understand the impact of different reasoning token budgets, we further collected battles from GPT-5-low and GPT-5-high on a temporary, experimental basis.
Other benchmarks suggest that GPT-5-high should rank highest, followed by GPT-5-medium and GPT-5-low. Surprisingly, we found that on Showdown, GPT-5’s ranking declines as the amount of reasoning effort increases. This suggests that in chat environments, given a fixed model, users disprefer variants of the model with higher reasoning effort. We include rankings before and after style control and find that this result is invariant. For further discussion of the rankings of thinking and non-thinking model variants, see our technical report.
Rankings incorporating Showdown leaderboard models and GPT-5 models, with and without style control. GPT-5 models are bolded. Notably, GPT-5-low ranks above GPT-5-medium, which ranks above GPT-5-high.
Model With style control Without style control claude-sonnet-4-5-20250929 1 1 claude-opus-4-1-20250805 1 4 gpt-5-chat 1 3 claude-sonnet-4-20250514 4 8 claude-opus-20250514 4 8 qwen3-235b-a22b-2507-v1 4 1 claude-sonnet-4-5-20250929 (Thinking) 4 1 claude-opus-4-1-20250805 (Thinking) 6 6 gpt-4.1-2025-04-14 8 11 claude-haiku-4-5-20251001 8 9 claude-opus-4-20250514 (Thinking) 9 10 o3-2025-04-16-medium 11 17 claude-sonnet-4-20250514 (Thinking) 11 11 gemini-2.5-pro-preview-06-05 12 1 claude-haiku-4-5-20251001 (Thinking) 12 8 gpt-5-2025-08-07-low 14 18 gemini-2.5-flash-preview-05-20 17 8 o4-mini-2025-04-16-medium 17 20 llama4-maverick-instruct-basic 18 20 gpt-5-2025-08-07-medium 19 18 deepseek-r1-0528 21 13 gpt-5-2025-08-07-high 21 20
According to the release notes for the GPT-5 API, GPT-5 is optimized for coding and agentic use cases. To measure whether GPT-5 underperformance is explained by Showdown’s topic distribution, we classify battle prompts as coding or non-coding. We include some examples of coding and non-coding battle prompts below.
Coding prompts
Non-coding prompts
Within Showdown’s chat setting, we fail to find compelling evidence that GPT-5 performance improves on coding prompts. GPT-5-low has slightly improved performance on coding prompts, while GPT-5-medium and -high show a slight degradation in performance. An important limitation of these findings is that coding prompts within Showdown may not reflect GPT-5’s target use case in API or agentic settings.
Rankings for Showdown and GPT-5 models, where the 'code' and 'non-code' rankings were computed on data partitions (determined by an LLM classifier on battle prompts), and the 'combined' ranking was computed on all data.
Model Code Non-code Combined claude-opus-4-1-20250805 1 1 1 gpt-5-chat 4 1 1 claude-sonnet-4-20250514 1 3 3 qwen3-235b-a22b-2507-v1 6 3 3 claude-opus-4-20250514 1 3 3 claude-opus-4-1-20250805 (Thinking) 1 5 5 gpt-4.1-2025-04-14 5 6 5 claude-opus-4-20250514 (Thinking) 2 8 8 o3-2025-04-16-medium 9 8 9 claude-sonnet-4-20250514 (Thinking) 4 9 9 gemini-2.5-pro-preview-06-05 6 9 9 gpt-5-2025-08-07-low 8 10 11 gemini-2.5-flash-preview-05-20 12 13 13 o4-mini-2025-04-16-medium 9 13 13 llama4-maverick-instruct-basic 10 15 14 gpt-5-2025-08-07-medium 16 14 15 deepsseek-r1-0528 6 17 17 gpt-5-2025-08-07-high 18 17 17
One of the key differences between Showdown and other evaluation settings is Showdown’s in-flow evaluation setting, where battles are mined between turns in natural human-LLM conversations. We hypothesized that this setting could explain GPT-5’s performance. To study this hypothesis, we ran an experiment where we attempted to collect human preferences in an adversarial setting.
In this experiment, we tasked human annotators with writing prompts for the express purpose of understanding performance differences between GPT-5 and other models hosted on Showdown. In this controlled setting, we found that GPT-5 maintained a low win rate for most opponent models (N=846). Strikingly, this result suggests that even when annotators are explicitly asked to vote on the basis of measuring model capability, GPT-5 is dispreferred to other models.
Results from the adversarial-setting experiment. We report GPT-5’s win rate against other models hosted on Showdown, excluding ties. With the exception of o4-mini, GPT-5 loses against all other models in the majority of the battles collected in this experiment.
Opponent Win Rate N gemini-2.5-flash-preview-05-20 0.25 79 claude-opus-4-1-20250805 0.26 82 gemini-2.5-pro-preview-06-05 0.28 74 claude-opus-4-1-20250805 (Thinking) 0.31 84 gpt-4.1-2025-04-14 0.31 80 claude-sonnet-4-20250514 (Thinking) 0.34 79 claude-opus-4-20250514 0.35 81 claude-sonnet-4-20250514 0.35 78 claude-opus-4-20250514 (Thinking) 0.35 79 o3-2025-04-16-medium 0.45 69 llama4-maverick-instruct-basic 0.46 78 o4-mini-2025-04-16-medium 0.53 55
We were surprised by the performance of GPT-5 on Showdown, which represents a significant departure from the model's high ranking on capability benchmarks and its reception by users. Our analysis consistently shows that users disprefer GPT-5's reasoning variants, with performance degrading as reasoning effort increases. We corroborated this finding in a separate, adversarial setting, suggesting this disparity is not an artifact of Showdown’s in-situ evaluation setting. While we recognize that GPT-5 was optimized for API-driven, agentic use cases, our results demonstrate empirically that performance on capability benchmarks does not directly translate to user preferences.
We share these early findings to foster an open discussion with the community. We will continue to share our findings from Showdown in subsequent research blog posts.