MASK
Performance Comparison
Claude 3.7 Sonnet (Thinking) (February 2025)
82.13 ±1.25
Claude 3 Opus
79.00 ±1.31
Claude 3.5 Sonnet (October 2024)
72.33 ±2.45
Claude 3.7 Sonnet (February 2025)
72.27 ±3.31
o1-Pro
61.60 ±0.86
Llama 3.1 405B Instruct
61.40 ±1.99
61.40 ±1.80
gpt 4o (November 2024)
60.07 ±2.07
GPT 4.5 Preview
56.93 ±4.02
o1 (December 2024)
59.27 ±1.25
Deepseek R1
57.32 ±2.58
Gemini 2.5 Pro Experimental (March 2025)
55.93 ±3.49
Llama 3.2 90B Vision Instruct
54.07 ±2.24
Llama 3.3 70B Instruct
51.93 ±4.98
o3 mini (Low)
49.73 ±3.23
51.13 ±1.03
50.00 ±2.20
Llama 4 Maverick
49.73 ±1.60
Gemini 2.0 Flash Thinking (January 2025)
49.53 ±0.76
Gemini 2.0 Flash
49.07 ±2.01
o3 mini (Medium)
48.93 ±1.25
Gemini 2.0 Pro Experimental (February 2025)
48.67 ±2.29
Mistral Large 2411
47.53 ±1.74
o3 mini (High)
46.80 ±2.58
Deepseek V3 (March 2025)
44.53 ±1.74