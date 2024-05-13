Leaderboards

Scale’s proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.

Evolving Competition

We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.

Expert Evaluations

Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.

Learn more about our evaluation methodology here →

Coding

Model
Score95% Confidence
1st
Claude 3.5 Sonnet*
1165
+33/-33
2nd
GPT-4 Turbo Preview
1131
+24/-27
3rd
GPT-4o
1125
+28/-27
4
Llama 3.1 405B Instruct
1123
+33/-30
5
Gemini 1.5 Pro (May 2024)
1085
+28/-28
6
Claude 3 Opus
1045
+24/-24
7
Gemini 1.5 Flash Preview
1024
+29/-25
7
Gemini 1.5 Pro (April 2024)
984
+29/-27
9
Claude 3 Sonnet
970
+27/-28
10
Llama 3 70B Instruct
967
+22/-23
11
Mistral Large
908
+24/-25
12
Gemini 1.0 Pro
781
+29/-29
13
CodeLlama 34B Instruct
690
+33/-35
*Potential contamination warning: Claude Sonnet 3.5 was evaluated six weeks after Claude 3, possibly allowing Anthropic to access the prompt set from API logs. However, Anthropic's policy states they don't train on these data.

Instruction Following

Model
Score95% Confidence
1st
Llama 3.1 405B Instruct
90.35
+1.53/-1.52
2nd
Claude 3.5 Sonnet*
90.17
+1.54/-1.54
3rd
GPT-4o
88.77
+1.37/-1.37
4
GPT-4 Turbo Preview
87.95
+1.30/-1.30
5
Llama 3 70B Instruct
85.65
+1.67/-1.67
6
Gemini 1.5 Pro (May 2024)
85.20
+1.67/-1.70
7
Mistral Large
85.11
+1.52/-1.52
8
Claude 3 Opus
84.78
+1.52/-1.52
9
Claude 3 Sonnet
83.26
+1.78/-1.79
10
Gemini 1.5 Pro (April 2024)
82.90
+2.02/-2.02
11
Gemini 1.5 Flash Preview
82.34
+1.79/-1.80
12
Gemini 1.0 Pro
73.65
+2.32/-2.32
13
CodeLlama 34B Instruct
66.78
+2.16/-2.15
Math

Model
Score95% Confidence
1st
Claude 3.5 Sonnet*
96.60
+1.02/-1.02
2nd
Llama 3.1 405B Instruct
95.60
+1.16/-1.16
3rd
Claude 3 Opus
95.19
+1.21/-1.21
4
GPT-4 Turbo Preview
95.10
+1.22/-1.21
5
GPT-4o
94.85
+1.25/-1.24
6
Claude 3 Sonnet
93.28
+1.41/-1.42
7
Gemini 1.5 Pro (May 2024)
92.28
+1.51/-1.50
8
Gemini 1.5 Pro (April 2024)
90.54
+1.65/-1.65
9
Llama 3 70B Instruct
90.12
+1.69/-1.68
9
Gemini 1.5 Flash Preview
90.12
+1.69/-1.68
11
Mistral Large
87.47
+1.87/-1.87
12
Gemini 1.0 Pro
79.83
+2.27/-2.26
13
CodeLlama 34B Instruct
37.51
+2.73/-2.73
Spanish

Model
Score95% Confidence
1st
GPT-4o
1139
+26/-26
2nd
Gemini 1.5 Pro (May 2024)
1127
+28/-24
3rd
GPT-4 Turbo Preview
1090
+24/-22
4
Gemini 1.5 Pro (April 2024)
1058
+27/-27
5
Gemini 1.5 Flash Preview
1029
+28/-27
6
Llama 3.1 405B Instruct
965
+31/-33
7
Claude 3 Opus
957
+23/-24
8
Llama 3 70B Instruct
937
+27/-23
9
Mistral Large
901
+24/-24
10
Claude 3 Sonnet
899
+25/-26
11
Gemini 1.0 Pro
898
+26/-25

