MASK
Performance Comparison
84.50±2.30
82.60±2.80
Claude 3.7 Sonnet (Thinking) (February 2025)
82.13±1.25
Claude 3 Opus
79.00±1.31
78.60±2.30
72.90±2.30
Claude 3.5 Sonnet (October 2024)
72.33±2.45
Claude 3.7 Sonnet (February 2025)
72.27±3.31
o1-Pro
61.60±0.86
Llama 3.1 405B Instruct
61.40±1.99
61.40±1.80
gpt 4o (November 2024)
60.07±2.07
GPT 4.5 Preview
56.93±4.02
56.40±4.98
o1 (December 2024)
59.27±1.25
Deepseek R1
57.32±2.58
Gemini 2.5 Pro Experimental (March 2025)
55.93±3.49
Llama 3.2 90B Vision Instruct
54.07±2.24
53.10±4.50
Llama 3.3 70B Instruct
51.93±4.98
o3 mini (Low)
49.73±3.23
51.13±1.03
50.00±2.20
Llama 4 Maverick
49.73±1.60
Gemini 2.0 Flash Thinking (January 2025)
49.53±0.76
Gemini 2.0 Flash
49.07±2.01
o3 mini (Medium)
48.93±1.25
Gemini 2.0 Pro Experimental (February 2025)
48.67±2.29
Mistral Large 2411
47.53±1.74
o3 mini (High)
46.80±2.58
Deepseek V3 (March 2025)
44.53±1.74