Agentic Tool Use (Chat)
Performance Comparison
o3-mini (high)
63.45 ±5.52
62.43 ±6.76
o3-mini (medium)
62.42 ±6.76
61.42 ±6.79
DeepSeek-R1
60.91 ±6.81
o1 (December 2024)
60.41 ±6.82
DeepSeek-V3 (December 2024)
58.55 ±6.87
Gemini 2.0 Pro Experimental (February 2025)
57.86 ±6.89
Gemini 2.0 Flash Thinking Experimental (January 2025)
57.36 ±6.91
GPT-4o (August 2024)
56.85 ±6.92
GPT-4.5 Preview (February 2025)
56.34 ±6.92
Claude 3.7 Sonnet (February 2025)
56.25 ±6.98
Claude 3.5 Sonnet (June 2024)
56.06 ±6.91
Claude 3.7 Sonnet Thinking (February 2025)
55.32 ±6.94
o1-preview
55.10 ±6.96
54.31 ±6.95
Gemini 2.0 Flash Experimental (December 2024)
53.29 ±6.96
GPT-4 Turbo Preview
53.03 ±6.95
Gemini 1.5 Pro (August 27, 2024)
51.27 ±6.98
Gemini 2.0 Flash Lite Preview (February 2025)
50.25 ±6.98
GPT-4o (May 2024)
49.50 ±6.96
Nova Pro
49.23 ±6.98
Claude 3 Opus
48.49 ±6.96
Gemini 2.0 Flash
48.22 ±6.97
Nova LIte
41.66 ±5.69
Nova Mirco
40.97 ±5.67
Claude 3 Sonnet
40.40 ±6.84
Mistral Large 2
40.40 ±6.84
Llama 3.1 405B Instruct
40.10 ±6.84
GPT-4
37.88 ±6.78
Gemini 1.5 Pro (May 2024)
35.50 ±6.57
Llama 3.1 70B Instruct
33.50 ±6.59
GPT-4o mini
32.83 ±6.54
Command R+
20.20 ±5.59
Llama 3.1 8B Instruct
6.09 ±3.34