Leaderboards

Private Datasets

Scale’s proprietary, private evaluation datasets can’t be gamed, ensuring unbiased and uncontaminated results.

Evolving Competition

We periodically update leaderboards with new datasets and models, fostering a dynamic, contest-like environment.

Expert Evaluations

Our evaluations are performed by thoroughly vetted experts using domain specific methodologies, ensuring the highest quality and credibility.

Learn more about our evaluation methodology here →

Coding

Learn More

Model
Score95% Confidence
1155
+21/-24
1144
+31/-32
1112
+27/-32
1071
+28/-25
1057
+25/-30
1010
+25/-26
1010
+22/-24
997
+26/-25
930
+23/-28
804
+38/-29
716
+24/-30

Math

Learn More

Model
Score95% Confidence
95.19
+1.21/-1.21
95.10
+1.22/-1.21
94.85
+1.25/-1.24
93.28
+1.41/-1.42
92.28
+1.51/-1.50
90.54
+1.65/-1.65
90.12
+1.69/-1.68
90.12
+1.69/-1.68
87.47
+1.87/-1.87
79.83
+2.27/-2.26
37.51
+2.73/-2.73

Instruction Following

Learn More

Model
Score95% Confidence
88.57
+1.53/-1.56
87.64
+1.46/-1.44
85.55
+1.85/-1.85
85.34
+1.66/-1.64
84.82
+1.68/-1.72
84.51
+1.89/-1.91
83.18
+1.82/-1.78
82.90
+2.00/-2.00
82.36
+2.04/-1.96
73.57
+2.32/-2.37
66.77
+2.12/-2.17

Spanish

Learn More

Model
Score95% Confidence
1139
+36/-28
1129
+25/-25
1088
+28/-32
1054
+25/-25
1023
+32/-23
941
+26/-26
934
+25/-25
896
+19/-33
896
+25/-24
895
+25/-23

If you’d like to add your model to this leaderboard or a future version, please contact seal@scale.com. To ensure leaderboard integrity, we require that models can only be featured the FIRST TIME when an organization encounters the prompts.