Agentic Tool Use (Enterprise)
Performance Comparison
o1 (December 2024)
70.14±5.32
68.75±5.35
67.01±5.43
o1-preview
66.43±5.47
DeepSeek-R1
65.27±5.49
o3-mini (high)
65.27±5.52
Claude 3.7 Sonnet Thinking (February 2025)
65.27±5.49
Claude 3.7 Sonnet (February 2025)
64.93±5.51
o3-mini (medium)
64.93±5.51
GPT-4o (May 2024)
64.58±5.52
64.23±5.53
GPT-4.5 Preview (February 2025)
63.76±5.56
Gemini 2.0 Flash Thinking Experimental (January 2025)
63.19±5.57
DeepSeek-V3 (December 2024)
62.50±5.59
Gemini 2.0 Pro Experimental (February 2025)
61.45±5.62
GPT-4 Turbo Preview
60.76±5.64
Gemini 1.5 Pro (August 27, 2024)
60.28±5.66
Gemini 2.0 Flash Experimental (December 2024)
60.07±5.65
GPT-4o (August 2024)
59.93±5.67
Claude 3.5 Sonnet (June 2024)
59.38±5.67
Gemini 2.0 Flash
55.60±5.73
Claude 3 Sonnet
54.17±5.78
Gemini 2.0 Flash Lite Preview (February 2025)
53.82±5.75
Claude 3 Opus
52.78±5.77
GPT-4o mini
51.74±5.77
GPT-4
51.39±5.77
Llama 3.1 405B Instruct
50.35±5.78
Mistral Large 2
50.35±5.78
Nova Pro
47.04±5.77
Gemini 1.5 Pro (May 2024)
40.42±5.68
Llama 3.1 70B Instruct
37.23±5.60
Nova LIte
36.53±6.72
Command R+
30.21±5.30
Nova Mirco
28.93±6.33
Llama 3.1 8B Instruct
17.42±4.39