<- Back to leaderboard
Agentic Tool Use (Enterprise)
To be updated shortly with DeepSeek-R1 and DeepSeek-V3
*
Model
Score
95% Confidence
1
o1 (December 2024)
70.14
+5.32 / -5.32
1
o1-preview
66.43
+5.47 / -5.47
1
Claude 3.7 Sonnet Thinking (February 2025)
65.27
+5.49 / -5.49
1
o3-mini (high)
65.27
+5.52 / -5.52
1
Claude 3.7 Sonnet (February 2025)
64.93
+5.51 / -5.51
1
o3-mini (medium)
64.93
+5.51 / -5.51
2
GPT-4o (May 2024)
64.58
+5.52 / -5.52
2
GPT-4.5 Preview (February 2025)
63.76
+5.56 / -5.56
2
Gemini 2.0 Flash Thinking Experimental (January 2025)
63.19
+5.57 / -5.57
2
Gemini 2.0 Pro Experimental (February 2025)
61.45
+5.62 / -5.62
3
GPT-4 Turbo Preview
60.76
+5.64 / -5.64
3
Gemini 1.5 Pro (August 27, 2024)
60.28
+5.66 / -5.66
3
Gemini 2.0 Flash Experimental (December 2024)
60.07
+5.65 / -5.65
3
GPT-4o (August 2024)
59.93
+5.67 / -5.67
5
Claude 3.5 Sonnet (June 2024)
59.38
+5.67 / -5.67
11
Gemini 2.0 Flash
55.60
+5.73 / -5.73
14
Claude 3 Sonnet
54.17
+5.78 / -5.78
15
Gemini 2.0 Flash Lite Preview (February 2025)
53.82
+5.75 / -5.75
16
Claude 3 Opus
52.78
+5.77 / -5.77
16
GPT-4o mini
51.74
+5.77 / -5.77
16
GPT-4
51.39
+5.77 / -5.77
16
Mistral Large 2
50.35
+5.78 / -5.78
16
Llama 3.1 405B Instruct
50.35
+5.78 / -5.78
19
Nova Pro
47.04
+5.77 / -5.77
25
Gemini 1.5 Pro (May 2024)
40.42
+5.68 / -5.68
25
Llama 3.1 70B Instruct
37.23
+5.60 / -5.60
25
Nova Lite
36.53
+6.72 / -6.72
28
Command R+
30.21
+5.30 / -5.30
28
Nova Mirco
28.93
+6.33 / -6.33
30
Llama 3.1 8B Instruct
17.42
+4.39 / -4.39
* Ranking is based on Rank(UB)