Scale AI logo
SEAL Logo
ShowdownFrontier Leaderboards
159 Hug × 24 Hug

Showdown Leaderboard - LLMs

SEAL Showdown Logo

Real people. Real conversations. Real rankings.

Showdown ranks AI models based on how they perform in real-world use— not synthetic tests or lab settings. Votes are blind, optional, and organic, so rankings reflect authentic preferences.Methodology & Technical Report→
0 promptsReal conversation prompts compared across models through pairwise votes.
0 usersFrom 80+ countries and 70+ languages, spanning all backgrounds and professions.

SEAL Leaderboard - LLMs

Style Control
RANK ↑
MODEL ↑↓
VOTES ↑↓
SCORE ↑↓
1

gpt-5.2-chat-latest

gpt-5.2-chat-latest
6986
1148.46
-6.71 +4.36
2

gemini-3-flash

gemini-3-flash
6194
1135.60
-6.46 +4.53
2

claude-opus-4-5-20251101 (Thinking)

claude-opus-4-5-20251101 (Thinking)
5384
1132.04
-4.67 +4.85
2

claude-opus-4-5-20251101

claude-opus-4-5-20251101
8062
1130.41
-4.27 +4.95
2

gemini-2.5-pro

gemini-2.5-pro
12231
1126.17
-4.85 +5.18
2

gemini-3-pro-preview

gemini-3-pro-preview
9236
1124.16
-4.67 +6.22
5

claude-sonnet-4-5-20250929

claude-sonnet-4-5-20250929
13120
1117.21
-3.34 +4.99
7

claude-sonnet-4-5-20250929 (Thinking)

claude-sonnet-4-5-20250929 (Thinking)
13081
1110.94
-3.68 +4.43
8

gpt-5-chat

gpt-5-chat
11529
1105.83
-2.97 +5.18
8

qwen3-235b-a22b-2507-v1

qwen3-235b-a22b-2507-v1
12602
1105.38
-4.08 +4.43
9

gpt-5.1-2025-11-13-medium

gpt-5.1-2025-11-13-medium
7953
1097.82
-5.37 +5.06
11

gpt-5.2-2025-12-11-medium

gpt-5.2-2025-12-11-medium
6799
1091.55
-6.69 +5.63
11

kimi-k2-thinking

kimi-k2-thinking
9487
1088.96
-4.40 +3.96
12

claude-opus-4-1-20250805

claude-opus-4-1-20250805
13895
1085.05
-3.06 +2.71
14

deepseek-v3p2

deepseek-v3p2
8124
1077.81
-5.32 +4.76
14

claude-opus-4-1-20250805 (Thinking)

claude-opus-4-1-20250805 (Thinking)
12273
1077.13
-2.92 +5.63
15

claude-haiku-4-5-20251001

claude-haiku-4-5-20251001
5842
1072.32
-5.14 +6.27
16

claude-sonnet-4-20250514

claude-sonnet-4-20250514
21758
1070.18
-2.88 +3.35
16

gemini-2.5-flash

gemini-2.5-flash
11787
1069.24
-5.33 +3.93
16

claude-haiku-4-5-20251001 (Thinking)

claude-haiku-4-5-20251001 (Thinking)
5734
1068.32
-6.10 +5.75
19

claude-sonnet-4-20250514 (Thinking)

claude-sonnet-4-20250514 (Thinking)
13836
1061.98
-3.87 +3.65
22

deepseek-r1-0528

deepseek-r1-0528
10901
1048.97
-4.66 +4.06
22

o3-2025-04-16-medium

o3-2025-04-16-medium*
21650
1043.71
-3.34 +2.78
24

gpt-5-2025-08-07-medium

gpt-5-2025-08-07-medium*
16606
1012.51
-3.79 +4.05
25

llama4-maverick-instruct-basic

llama4-maverick-instruct-basic
12587
1000.00
-3.87 +4.48
25

o4-mini-2025-04-16-medium

o4-mini-2025-04-16-medium*
22231
998.21
-2.48 +3.58
* This model’s API does not consistently return Markdown-formatted responses. Since raw outputs are used in head-to-head comparisons, this may affect its ranking.

Performance Comparison Across Language Models

Win Rate vs. Each Model iconWin Rate vs. Each Model
Battle Count vs. Each Model iconBattle Count vs. Each Model
Confidence Intervals iconConfidence Intervals
Average Win Rate iconAverage Win Rate
Prompt Distribution iconPrompt Distribution

Win Rate vs. Each Model

Win Rate vs Each Model

Battle Count vs. Each Model

Battle Count vs. Each Model

Confidence

Confidence Intervals

Average Win Rate

Average Win Rate

Prompt Distribution

Prompt Distribution

Copyright 2025 Scale Inc. All rights reserved.

&Terms of Use&Privacy Policy