Agentic Tool Use (Chat)
Performance Comparison
o3-mini (high)
63.45±5.52
62.43±6.76
o3-mini (medium)
62.42±6.76
61.42±6.79
DeepSeek-R1
60.91±6.81
o1 (December 2024)
60.41±6.82
DeepSeek-V3 (December 2024)
58.55±6.87
Gemini 2.0 Pro Experimental (February 2025)
57.86±6.89
Gemini 2.0 Flash Thinking Experimental (January 2025)
57.36±6.91
GPT-4o (August 2024)
56.85±6.92
GPT-4.5 Preview (February 2025)
56.34±6.92
Claude 3.7 Sonnet (February 2025)
56.25±6.98
Claude 3.5 Sonnet (June 2024)
56.06±6.91
Claude 3.7 Sonnet Thinking (February 2025)
55.32±6.94
o1-preview
55.10±6.96
54.31±6.95
Gemini 2.0 Flash Experimental (December 2024)
53.29±6.96
GPT-4 Turbo Preview
53.03±6.95
Gemini 1.5 Pro (August 27, 2024)
51.27±6.98
Gemini 2.0 Flash Lite Preview (February 2025)
50.25±6.98
GPT-4o (May 2024)
49.50±6.96
Nova Pro
49.23±6.98
Claude 3 Opus
48.49±6.96
Gemini 2.0 Flash
48.22±6.97
Nova LIte
41.66±5.69
Nova Mirco
40.97±5.67
Claude 3 Sonnet
40.40±6.84
Mistral Large 2
40.40±6.84
Llama 3.1 405B Instruct
40.10±6.84
GPT-4
37.88±6.78
Gemini 1.5 Pro (May 2024)
35.50±6.57
Llama 3.1 70B Instruct
33.50±6.59
GPT-4o mini
32.83±6.54
Command R+
20.20±5.59
Llama 3.1 8B Instruct
6.09±3.34