VISTA
Visual Language Understanding
Performance Comparison
54.65±1.46
51.79±0.63
51.66±1.08
50.78±0.57
50.07±1.14
49.59±0.66
49.15±0.36
47.32±1.78
48.23±0.70
46.97±1.29
46.96±0.95
45.50±1.20
45.34±0.91
45.49±0.21
45.25±0.40
43.53±1.24
43.25±1.26
43.21±0.52
43.02±1.14
42.11±1.39
41.14±0.58
39.95±0.80
39.85±0.71
Claude 3.5 Sonnet (October 2024)
38.72±0.51
Claude 3.5 Sonnet (June 2024)
38.37±0.70
38.33±0.55
ChatGPT-4o-latest (November 2024)
37.99±0.48
Gemini 1.5 Pro
37.07±1.34
GPT-4o (August 2024)
34.94±0.23
34.59±1.12
Gemini 1.5 Flash 002
34.03±1.41
Pixtral Large (November 2024)
33.89±0.69
32.69±1.40
Qwen2-VL-72B-Instruct
28.56±1.37
Claude 3 Opus
27.82±0.55
26.55±0.35
Nova Pro
26.27±0.61
Pixtral 12B (September 2024)
25.97±0.74
Nova Lite
25.50±0.77
Llama 3.2 90B Vision Instruct
24.61±0.80
Llama 3.2 11B Vision-Instruct
20.47±0.15
Phi 3.5 Vision-Instruct
15.18±0.81