<- Back to leaderboard
Agentic Tool Use (Chat)
Model
Score
95% CI
1
o3-mini (high)
63.45
+5.52 / -5.52
1
Gemini 2.5 Pro Experimental (March 2025)
62.43
+6.76 / -6.76
1
o3-mini (medium)
62.42
+6.76 / -6.76
1
DeepSeek R1
60.91
+6.81 / -6.81
1
o1 (December 2024)
60.41
+6.82 / -6.82
1
Deepseek V3
58.55
+6.87 / -6.87
1
Gemini 2.0 Pro Experimental (February 2025)
57.86
+6.89 / -6.89
1
Gemini 2.0 Flash Thinking Experimental (January 2025)
57.36
+6.91 / -6.91
1
GPT-4o (August 2024)
56.85
+6.92 / -6.92
1
GPT-4.5 Preview (February 2025)
56.34
+6.92 / -6.92
1
Claude 3.7 Sonnet (February 2025)
56.25
+6.98 / -6.98
1
Claude 3.5 Sonnet (June 2024)
56.06
+6.91 / -6.91
1
Claude 3.7 Sonnet Thinking (February 2025)
55.32
+6.94 / -6.94
1
o1-preview
55.10
+6.96 / -6.96
1
Gemini 2.0 Flash Experimental (December 2024)
53.29
+6.96 / -6.96
1
GPT-4 Turbo Preview
53.03
+6.95 / -6.95
1
Gemini 1.5 Pro (August 27, 2024)
51.27
+6.98 / -6.98
2
Gemini 2.0 Flash Lite Preview (February 2025)
50.25
+6.98 / -6.98
2
GPT-4o (May 2024)
49.50
+6.96 / -6.96
2
Nova Pro
49.23
+6.98 / -6.98
4
Claude 3 Opus
48.49
+6.96 / -6.96
4
Gemini 2.0 Flash
48.22
+6.97 / -6.97
15
Nova Lite
41.66
+5.69 / -5.69
15
Nova Mirco
40.97
+5.67 / -5.67
15
Claude 3 Sonnet
40.40
+6.84 / -6.84
15
Mistral Large 2
40.40
+6.84 / -6.84
15
Llama 3.1 405B Instruct
40.10
+6.84 / -6.84
17
GPT-4
37.88
+6.78 / -6.78
21
Gemini 1.5 Pro (May 2024)
35.50
+6.57 / -6.57
23
Llama 3.1 70B Instruct
33.50
+6.59 / -6.59
23
GPT-4o mini
32.83
+6.54 / -6.54
32
Command R+
20.20
+5.59 / -5.59
33
Llama 3.1 8B Instruct
6.09
+3.34 / -3.34
Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.
-
o1-pro coming soon.