Scale AI logo
SEAL Logo

MASK

Last updated: April 10, 2025

Performance Comparison

1

Claude Sonnet 4 (Thinking)

NEW

95.33±2.29

2

89.27±2.01

2

Claude Opus 4 (Thinking)

NEW

87.87±3.76

3

84.47±2.35

3

82.60±2.77

4

Claude 3.7 Sonnet (Thinking) (February 2025)

82.13±1.25

5

80.28±0.62

6

Claude 3 Opus

79.00±1.31

6

78.60±2.28

10

72.93±2.25

10

Claude 3.5 Sonnet (October 2024)

72.33±2.45

10

Claude 3.7 Sonnet (February 2025)

72.27±3.31

13

o1-Pro

61.60±0.86

13

Llama 3.1 405B Instruct

61.40±1.99

13

61.40±1.79

13

gpt 4o (November 2024)

60.07±2.07

13

GPT 4.5 Preview

56.93±4.02

13

56.40±4.98

14

o1 (December 2024)

59.27±1.25

14

Deepseek R1 (Jan 2025)

57.32±2.58

15

Gemini 2.5 Pro Experimental (March 2025)

55.93±3.49

18

Llama 3.2 90B Vision Instruct

54.07±2.24

18

53.07±4.45

18

Deepseek R1 (May 2025)

NEW

53.00±4.20

18

Llama 3.3 70B Instruct

51.93±4.98

19

o3 mini (Low)

49.73±3.23

19

49.13±4.28

21

51.13±1.03

21

50.00±2.17

23

Llama 4 Maverick

49.73±1.60

23

Gemini 2.0 Flash Thinking (January 2025)

49.53±0.76

23

Gemini 2.0 Flash

49.07±2.01

23

o3 mini (Medium)

48.93±1.25

23

Gemini 2.0 Pro Experimental (February 2025)

48.67±2.29

24

Mistral Large 2411

47.53±1.74

24

o3 mini (High)

46.80±2.58

34

Deepseek V3 (March 2025)

44.53±1.74

34

Mistral Medium 3

42.60±3.26