MASK
Performance Comparison
Claude Sonnet 4 (Thinking)
95.33±2.29
89.27±2.01
Claude Opus 4 (Thinking)
87.87±3.76
84.47±2.35
82.60±2.77
Claude 3.7 Sonnet (Thinking) (February 2025)
82.13±1.25
80.28±0.62
Claude 3 Opus
79.00±1.31
78.60±2.28
72.93±2.25
Claude 3.5 Sonnet (October 2024)
72.33±2.45
Claude 3.7 Sonnet (February 2025)
72.27±3.31
o1-Pro
61.60±0.86
Llama 3.1 405B Instruct
61.40±1.99
61.40±1.79
gpt 4o (November 2024)
60.07±2.07
GPT 4.5 Preview
56.93±4.02
56.40±4.98
o1 (December 2024)
59.27±1.25
Deepseek R1 (Jan 2025)
57.32±2.58
Gemini 2.5 Pro Experimental (March 2025)
55.93±3.49
Llama 3.2 90B Vision Instruct
54.07±2.24
53.07±4.45
Deepseek R1 (May 2025)
53.00±4.20
Llama 3.3 70B Instruct
51.93±4.98
o3 mini (Low)
49.73±3.23
49.13±4.28
51.13±1.03
50.00±2.17
Llama 4 Maverick
49.73±1.60
Gemini 2.0 Flash Thinking (January 2025)
49.53±0.76
Gemini 2.0 Flash
49.07±2.01
o3 mini (Medium)
48.93±1.25
Gemini 2.0 Pro Experimental (February 2025)
48.67±2.29
Mistral Large 2411
47.53±1.74
o3 mini (High)
46.80±2.58
Deepseek V3 (March 2025)
44.53±1.74
Mistral Medium 3
42.60±3.26