Agentic Tool Use (Enterprise)
Performance Comparison
o1 (December 2024)
70.14 ±5.32
68.75 ±5.35
67.01 ±5.43
o1-preview
66.43 ±5.47
DeepSeek-R1
65.27 ±5.49
o3-mini (high)
65.27 ±5.52
Claude 3.7 Sonnet Thinking (February 2025)
65.27 ±5.49
Claude 3.7 Sonnet (February 2025)
64.93 ±5.51
o3-mini (medium)
64.93 ±5.51
GPT-4o (May 2024)
64.58 ±5.52
64.23 ±5.53
GPT-4.5 Preview (February 2025)
63.76 ±5.56
Gemini 2.0 Flash Thinking Experimental (January 2025)
63.19 ±5.57
DeepSeek-V3 (December 2024)
62.50 ±5.59
Gemini 2.0 Pro Experimental (February 2025)
61.45 ±5.62
GPT-4 Turbo Preview
60.76 ±5.64
Gemini 1.5 Pro (August 27, 2024)
60.28 ±5.66
Gemini 2.0 Flash Experimental (December 2024)
60.07 ±5.65
GPT-4o (August 2024)
59.93 ±5.67
Claude 3.5 Sonnet (June 2024)
59.38 ±5.67
Gemini 2.0 Flash
55.60 ±5.73
Claude 3 Sonnet
54.17 ±5.78
Gemini 2.0 Flash Lite Preview (February 2025)
53.82 ±5.75
Claude 3 Opus
52.78 ±5.77
GPT-4o mini
51.74 ±5.77
GPT-4
51.39 ±5.77
Llama 3.1 405B Instruct
50.35 ±5.78
Mistral Large 2
50.35 ±5.78
Nova Pro
47.04 ±5.77
Gemini 1.5 Pro (May 2024)
40.42 ±5.68
Llama 3.1 70B Instruct
37.23 ±5.60
Nova LIte
36.53 ±6.72
Command R+
30.21 ±5.30
Nova Mirco
28.93 ±6.33
Llama 3.1 8B Instruct
17.42 ±4.39