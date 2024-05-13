Leaderboards
Expert-Driven Private Evaluations
Private Datasets
Scale’s proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.
Evolving Competition
We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.
Expert Evaluations
Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.
Learn more about our evaluation methodology here →
Coding→
Model
|Score
|95% Confidence
1165
+33/-33
1131
+24/-27
GPT-4o
3rd
1125
+28/-27
1123
+33/-30
1085
+28/-28
1045
+24/-24
1024
+29/-25
984
+29/-27
970
+27/-28
967
+22/-23
908
+24/-25
781
+29/-29
690
+33/-35
Instruction Following→
Model
|Score
|95% Confidence
90.35
+1.53/-1.52
90.17
+1.54/-1.54
GPT-4o
3rd
88.77
+1.37/-1.37
87.95
+1.30/-1.30
85.65
+1.67/-1.67
85.20
+1.67/-1.70
85.11
+1.52/-1.52
84.78
+1.52/-1.52
83.26
+1.78/-1.79
82.90
+2.02/-2.02
82.34
+1.79/-1.80
73.65
+2.32/-2.32
66.78
+2.16/-2.15
Math→
Model
|Score
|95% Confidence
96.60
+1.02/-1.02
95.60
+1.16/-1.16
95.19
+1.21/-1.21
95.10
+1.22/-1.21
94.85
+1.25/-1.24
93.28
+1.41/-1.42
92.28
+1.51/-1.50
90.54
+1.65/-1.65
90.12
+1.69/-1.68
90.12
+1.69/-1.68
87.47
+1.87/-1.87
79.83
+2.27/-2.26
37.51
+2.73/-2.73
Spanish→
Model
|Score
|95% Confidence
GPT-4o
1st
1139
+26/-26
1127
+28/-24
1090
+24/-22
1058
+27/-27
1029
+28/-27
965
+31/-33
957
+23/-24
937
+27/-23
901
+24/-24
899
+25/-26
898
+26/-25
If you’d like to add your model to this leaderboard or a future version, please contact seal@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.