Frontier Leaderboards
2025 Scale AI. All rights reserved.
Instruction Following
Deprecated (as of January 2025)
Last updated: March 20, 2025
Performance Comparison
1
o1 (December 2024)
91.96±1.60
2
DeepSeek R1
87.75±1.91
3
o1-preview
86.58±1.58
4
Gemini 2.0 Flash Experimental (December 2024)
86.58±1.83
5
Claude 3.5 Sonnet (June 2024)
85.96±1.39
6
GPT-4o (May 2024)
85.29±1.42
7
Llama 3.1 405B Instruct
84.85±1.40
8
Gemini 1.5 Pro (August 27, 2024)
84.17±1.65
9
GPT-4 Turbo Preview
83.19±1.31
10
Mistral Large 2
82.81±1.66
11
GPT-4o (November 2024)
82.52±2.10
12
Deepseek V3
82.34±2.08
13
Llama 3.2 90B Vision Instruct
82.07±1.74
14
Llama 3 70B Instruct
81.17±1.77
15
GPT-4o (August 2024)
80.17±1.70
16
Claude 3 Opus
80.12±1.54
17
Mistral Large
79.89±1.67
18
GPT-4 (November 2024)
79.50±1.92
19
Gemini 1.5 Pro (May 2024)
79.37±1.70