<- Back to leaderboard
Agentic Tool Use (Enterprise)
Model
Score
95% CI
1
o1 (December 2024)
70.14
+5.32 / -5.32
1
Gemini 2.5 Pro Experimental (March 2025)
68.75
+5.35 / -5.35
1
o1-preview
66.43
+5.47 / -5.47
1
DeepSeek R1
65.27
+5.49 / -5.49
1
Claude 3.7 Sonnet Thinking (February 2025)
65.27
+5.49 / -5.49
1
o3-mini (high)
65.27
+5.52 / -5.52
1
Claude 3.7 Sonnet (February 2025)
64.93
+5.51 / -5.51
1
o3-mini (medium)
64.93
+5.51 / -5.51
1
GPT-4o (May 2024)
64.58
+5.52 / -5.52
1
GPT-4.5 Preview (February 2025)
63.76
+5.56 / -5.56
1
Gemini 2.0 Flash Thinking Experimental (January 2025)
63.19
+5.57 / -5.57
1
Deepseek V3
62.50
+5.59 / -5.59
1
Gemini 2.0 Pro Experimental (February 2025)
61.45
+5.62 / -5.62
1
GPT-4 Turbo Preview
60.76
+5.64 / -5.64
1
Gemini 1.5 Pro (August 27, 2024)
60.28
+5.66 / -5.66
1
Gemini 2.0 Flash Experimental (December 2024)
60.07
+5.65 / -5.65
1
GPT-4o (August 2024)
59.93
+5.67 / -5.67
1
Claude 3.5 Sonnet (June 2024)
59.38
+5.67 / -5.67
3
Gemini 2.0 Flash
55.60
+5.73 / -5.73
4
Claude 3 Sonnet
54.17
+5.78 / -5.78
7
Gemini 2.0 Flash Lite Preview (February 2025)
53.82
+5.75 / -5.75
10
Claude 3 Opus
52.78
+5.77 / -5.77
12
GPT-4o mini
51.74
+5.77 / -5.77
12
GPT-4
51.39
+5.77 / -5.77
13
Mistral Large 2
50.35
+5.78 / -5.78
13
Llama 3.1 405B Instruct
50.35
+5.78 / -5.78
19
Nova Pro
47.04
+5.77 / -5.77
23
Gemini 1.5 Pro (May 2024)
40.42
+5.68 / -5.68
27
Llama 3.1 70B Instruct
37.23
+5.60 / -5.60
27
Nova Lite
36.53
+6.72 / -6.72
28
Command R+
30.21
+5.30 / -5.30
28
Nova Mirco
28.93
+6.33 / -6.33
33
Llama 3.1 8B Instruct
17.42
+4.39 / -4.39
Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.
-
o1-pro coming soon.