Scale AI logo
SEAL Logo

MASK

Last updated: April 10, 2025

Performance Comparison

1

Claude Sonnet 4 (Thinking)

95.33±2.29

2

89.27±2.01

2

Claude Opus 4 (Thinking)

87.87±3.76

3

84.47±2.35

3

82.60±2.77

3

82.50±1.60

4

Claude 3.7 Sonnet (Thinking) (February 2025)

82.13±1.25

6

80.28±0.62

7

Claude 3 Opus

79.00±1.31

7

78.60±2.28

11

72.93±2.25

11

Claude 3.5 Sonnet (October 2024)

72.33±2.45

11

Claude 3.7 Sonnet (February 2025)

72.27±3.31

14

o1-Pro

61.60±0.86

14

Llama 3.1 405B Instruct

61.40±1.99

14

61.40±1.79

14

gpt 4o (November 2024)

60.07±2.07

14

GPT 4.5 Preview

56.93±4.02

14

56.40±4.98

15

o1 (December 2024)

59.27±1.25

15

Deepseek R1 (Jan 2025)

57.32±2.58

15

55.67±4.51

16

56.50±3.00

16

Gemini 2.5 Pro Experimental (March 2025)

55.93±3.49

19

Llama 3.2 90B Vision Instruct

54.07±2.24

19

53.07±4.45

19

DeepSeek-R1-0528

53.00±4.20

19

Llama 3.3 70B Instruct

51.93±4.98

21

o3 mini (Low)

49.73±3.23

21

49.13±4.28

23

51.13±1.03

23

50.00±2.17

25

Llama 4 Maverick

49.73±1.60

26

Gemini 2.0 Flash Thinking (January 2025)

49.53±0.76

26

Gemini 2.0 Flash

49.07±2.01

26

o3 mini (Medium)

48.93±1.25

26

Gemini 2.0 Pro Experimental (February 2025)

48.67±2.29

27

Mistral Large 2411

47.53±1.74

27

o3 mini (High)

46.80±2.58

37

Deepseek V3 (March 2025)

44.53±1.74

37

Mistral Medium 3

42.60±3.26