Scale Logo
SEAL Logo

MASK

Last updated: April 10, 2025

Performance Comparison

1

Claude 3.7 Sonnet (Thinking) (February 2025)

82.13 ±1.25

2

Claude 3 Opus

79.00 ±1.31

3

Claude 3.5 Sonnet (October 2024)

72.33 ±2.45

3

Claude 3.7 Sonnet (February 2025)

72.27 ±3.31

5

o1-Pro

61.60 ±0.86

5

Llama 3.1 405B Instruct

61.40 ±1.99

5

61.40 ±1.80

5

gpt 4o (November 2024)

60.07 ±2.07

5

GPT 4.5 Preview

56.93 ±4.02

6

o1 (December 2024)

59.27 ±1.25

6

Deepseek R1

57.32 ±2.58

7

Gemini 2.5 Pro Experimental (March 2025)

55.93 ±3.49

10

Llama 3.2 90B Vision Instruct

54.07 ±2.24

10

Llama 3.3 70B Instruct

51.93 ±4.98

11

o3 mini (Low)

49.73 ±3.23

13

51.13 ±1.03

13

50.00 ±2.20

14

Llama 4 Maverick

49.73 ±1.60

14

Gemini 2.0 Flash Thinking (January 2025)

49.53 ±0.76

14

Gemini 2.0 Flash

49.07 ±2.01

14

o3 mini (Medium)

48.93 ±1.25

14

Gemini 2.0 Pro Experimental (February 2025)

48.67 ±2.29

14

Mistral Large 2411

47.53 ±1.74

14

o3 mini (High)

46.80 ±2.58

22

Deepseek V3 (March 2025)

44.53 ±1.74