Model
Score
95% Confidence
1st
Claude 3.5 Sonnet (June 2024)
96.60
+1.02 / -1.02
2nd
GPT-4o (August 2024)
95.68
+1.15 / -1.15
3rd
Llama 3.1 405B Instruct
95.60
+1.16 / -1.16
4
Claude 3 Opus
95.19
+1.21 / -1.21
5
GPT-4 Turbo Preview
95.10
+1.22 / -1.22
6
GPT-4o (May 2024)
94.85
+1.25 / -1.25
7
Gemini 1.5 Pro (August 27, 2024)
94.69
+1.27 / -1.27
8
Mistral Large 2
93.94
+1.35 / -1.35
9
Claude 3 Sonnet
93.28
+1.41 / -1.41
10
Gemini 1.5 Pro (May 2024)
92.28
+1.51 / -1.51
11
Gemini 1.5 Pro (April 2024)
90.54
+1.65 / -1.65
12
Llama 3 70B Instruct
90.12
+1.69 / -1.69
13
Gemini 1.5 Flash
90.12
+1.69 / -1.69
14
Mistral Large
87.47
+1.87 / -1.87
15
Gemini 1.0 Pro
79.83
+2.27 / -2.27
16
CodeLlama 34B Instruct
37.51
+2.73 / -2.73