Scale AI logo
SEAL Logo

Showdown Leaderboard - LLMs

SEAL Showdown Logo

Real people. Real conversations. Real rankings.

Showdown ranks AI models based on how they perform in real-world use— not synthetic tests or lab settings. Votes are blind, optional, and organic, so rankings reflect authentic preferences.Methodology & Technical Report
0 promptsReal conversation prompts compared across models through pairwise votes.
0 usersFrom 80+ countries and 70+ languages, spanning all backgrounds and professions.

SEAL Leaderboard - LLMs

RANK
MODEL ↑↓
VOTES ↑↓
SCORE ↑↓
1

gpt-5-chat

gpt-5-chat
8227
1106.65
-5.61 +5.12
1

claude-opus-4-1-20250805

claude-opus-4-1-20250805
10748
1105.29
-5.12 +3.94
3

claude-sonnet-4-20250514

claude-sonnet-4-20250514
12352
1084.08
-4.29 +3.89
3

claude-opus-4-20250514

claude-opus-4-20250514
10965
1077.62
-3.84 +4.38
5

claude-opus-4-1-20250805 (Thinking)

claude-opus-4-1-20250805 (Thinking)
9527
1069.21
-6.27 +3.84
5

gpt-4.1-2025-04-14

gpt-4.1-2025-04-14
12670
1065.49
-3.98 +3.47
7

gemini-2.5-pro-preview-06-05

gemini-2.5-pro-preview-06-05
10874
1047.16
-4.17 +5.4
7

claude-opus-4-20250514 (Thinking)

claude-opus-4-20250514 (Thinking)
10707
1046.89
-5.18 +4.29
7

claude-sonnet-4-20250514 (Thinking)

claude-sonnet-4-20250514 (Thinking)
12457
1043.08
-3.75 +3.37
10

o3-2025-04-16-medium

o3-2025-04-16-medium*
14607
1021.44
-3.67 +4.4
10

gemini-2.5-flash-preview-05-20

gemini-2.5-flash-preview-05-20
13134
1019.13
-5 +4.4
12

llama4-maverick-instruct-basic

llama4-maverick-instruct-basic
13413
1000.00
-4.95 +3.82
13

o4-mini-2025-04-16-medium

o4-mini-2025-04-16-medium*
13727
989.19
-3.87 +4.2
* This model’s API does not consistently return Markdown-formatted responses. Since raw outputs are used in head-to-head comparisons, this may affect its ranking.

Performance Comparison Across Language Models

Win Rate vs. Each Model

Win Rate vs Each Model

Battle Count vs. Each Model

Battle Count vs. Each Model

Confidence

Confidence Intervals

Average Win Rate

Average Win Rate

Prompt Distribution

Prompt Distribution