Scale Logo
SEAL Logo

Agentic Tool Use (Enterprise)

Last updated: April 10, 2025

Performance Comparison

1

o1 (December 2024)

70.14 ±5.32

1

68.75 ±5.35

1

67.01 ±5.43

1

o1-preview

66.43 ±5.47

1

DeepSeek-R1

65.27 ±5.49

1

o3-mini (high)

65.27 ±5.52

1

Claude 3.7 Sonnet Thinking (February 2025)

65.27 ±5.49

1

Claude 3.7 Sonnet (February 2025)

64.93 ±5.51

1

o3-mini (medium)

64.93 ±5.51

1

GPT-4o (May 2024)

64.58 ±5.52

1

64.23 ±5.53

1

GPT-4.5 Preview (February 2025)

63.76 ±5.56

1

Gemini 2.0 Flash Thinking Experimental (January 2025)

63.19 ±5.57

1

DeepSeek-V3 (December 2024)

62.50 ±5.59

1

Gemini 2.0 Pro Experimental (February 2025)

61.45 ±5.62

1

GPT-4 Turbo Preview

60.76 ±5.64

1

Gemini 1.5 Pro (August 27, 2024)

60.28 ±5.66

1

Gemini 2.0 Flash Experimental (December 2024)

60.07 ±5.65

1

GPT-4o (August 2024)

59.93 ±5.67

1

Claude 3.5 Sonnet (June 2024)

59.38 ±5.67

4

Gemini 2.0 Flash

55.60 ±5.73

5

Claude 3 Sonnet

54.17 ±5.78

8

Gemini 2.0 Flash Lite Preview (February 2025)

53.82 ±5.75

12

Claude 3 Opus

52.78 ±5.77

14

GPT-4o mini

51.74 ±5.77

14

GPT-4

51.39 ±5.77

15

Llama 3.1 405B Instruct

50.35 ±5.78

15

Mistral Large 2

50.35 ±5.78

21

Nova Pro

47.04 ±5.77

25

Gemini 1.5 Pro (May 2024)

40.42 ±5.68

29

Llama 3.1 70B Instruct

37.23 ±5.60

29

Nova LIte

36.53 ±6.72

30

Command R+

30.21 ±5.30

30

Nova Mirco

28.93 ±6.33

35

Llama 3.1 8B Instruct

17.42 ±4.39