Frontier Leaderboards
2025 Scale AI. All rights reserved.
Instruction Following
Deprecated (as of January 2025)
Last updated: March 20, 2025
Performance Comparison
1
o1 (December 2024)
91.96 ±1.60
2
DeepSeek R1
87.75 ±1.91
3
o1-preview
86.58 ±1.58
4
Gemini 2.0 Flash Experimental (December 2024)
86.58 ±1.83
5
Claude 3.5 Sonnet (June 2024)
85.96 ±1.39
6
GPT-4o (May 2024)
85.29 ±1.42
7
Llama 3.1 405B Instruct
84.85 ±1.40
8
Gemini 1.5 Pro (August 27, 2024)
84.17 ±1.65
9
GPT-4 Turbo Preview
83.19 ±1.31
10
Mistral Large 2
82.81 ±1.66
11
GPT-4o (November 2024)
82.52 ±2.10
12
Deepseek V3
82.34 ±2.08
13
Llama 3.2 90B Vision Instruct
82.07 ±1.74
14
Llama 3 70B Instruct
81.17 ±1.77
15
GPT-4o (August 2024)
80.17 ±1.70
16
Claude 3 Opus
80.12 ±1.54
17
Mistral Large
79.89 ±1.67
18
GPT-4 (November 2024)
79.50 ±1.92
19
Gemini 1.5 Pro (May 2024)
79.37 ±1.70