<- Back to leaderboard
Agentic Tool Use (Enterprise)
*
Model
Score
95% CI
1
o1 (December 2024)
70.14
+5.32 / -5.32
1
o1-preview
66.43
+5.47 / -5.47
1
DeepSeek R1
65.27
+5.49 / -5.49
1
Claude 3.7 Sonnet Thinking (February 2025)
65.27
+5.49 / -5.49
1
o3-mini (high)
65.27
+5.52 / -5.52
1
Claude 3.7 Sonnet (February 2025)
64.93
+5.51 / -5.51
1
o3-mini (medium)
64.93
+5.51 / -5.51
2
GPT-4o (May 2024)
64.58
+5.52 / -5.52
2
GPT-4.5 Preview (February 2025)
63.76
+5.56 / -5.56
2
Gemini 2.0 Flash Thinking Experimental (January 2025)
63.19
+5.57 / -5.57
2
Deepseek V3
62.50
+5.59 / -5.59
2
Gemini 2.0 Pro Experimental (February 2025)
61.45
+5.62 / -5.62
3
GPT-4 Turbo Preview
60.76
+5.64 / -5.64
3
Gemini 1.5 Pro (August 27, 2024)
60.28
+5.66 / -5.66
3
Gemini 2.0 Flash Experimental (December 2024)
60.07
+5.65 / -5.65
3
GPT-4o (August 2024)
59.93
+5.67 / -5.67
5
Claude 3.5 Sonnet (June 2024)
59.38
+5.67 / -5.67
11
Gemini 2.0 Flash
55.60
+5.73 / -5.73
14
Claude 3 Sonnet
54.17
+5.78 / -5.78
16
Gemini 2.0 Flash Lite Preview (February 2025)
53.82
+5.75 / -5.75
17
Claude 3 Opus
52.78
+5.77 / -5.77
17
GPT-4o mini
51.74
+5.77 / -5.77
17
GPT-4
51.39
+5.77 / -5.77
18
Mistral Large 2
50.35
+5.78 / -5.78
18
Llama 3.1 405B Instruct
50.35
+5.78 / -5.78
21
Nova Pro
47.04
+5.77 / -5.77
27
Gemini 1.5 Pro (May 2024)
40.42
+5.68 / -5.68
27
Llama 3.1 70B Instruct
37.23
+5.60 / -5.60
27
Nova Lite
36.53
+6.72 / -6.72
30
Command R+
30.21
+5.30 / -5.30
30
Nova Mirco
28.93
+6.33 / -6.33
32
Llama 3.1 8B Instruct
17.42
+4.39 / -4.39
* Ranking is based on Rank(UB)