Scale Logo
SEAL Logo

Instruction Following

Deprecated (as of January 2025)

Last updated: March 20, 2025

Performance Comparison

1

o1 (December 2024)

91.96 ±1.60

2

DeepSeek R1

87.75 ±1.91

3

o1-preview

86.58 ±1.58

4

Gemini 2.0 Flash Experimental (December 2024)

86.58 ±1.83

5

Claude 3.5 Sonnet (June 2024)

85.96 ±1.39

6

GPT-4o (May 2024)

85.29 ±1.42

7

Llama 3.1 405B Instruct

84.85 ±1.40

8

Gemini 1.5 Pro (August 27, 2024)

84.17 ±1.65

9

GPT-4 Turbo Preview

83.19 ±1.31

10

Mistral Large 2

82.81 ±1.66

11

GPT-4o (November 2024)

82.52 ±2.10

12

Deepseek V3

82.34 ±2.08

13

Llama 3.2 90B Vision Instruct

82.07 ±1.74

14

Llama 3 70B Instruct

81.17 ±1.77

15

GPT-4o (August 2024)

80.17 ±1.70

16

Claude 3 Opus

80.12 ±1.54

17

Mistral Large

79.89 ±1.67

18

GPT-4 (November 2024)

79.50 ±1.92

19

Gemini 1.5 Pro (May 2024)

79.37 ±1.70