Scale Logo
SEAL Logo

Agentic Tool Use (Enterprise)

Last updated: April 10, 2025

Performance Comparison

1

o1 (December 2024)

70.14±5.32

1

68.75±5.35

1

67.01±5.43

1

o1-preview

66.43±5.47

1

DeepSeek-R1

65.27±5.49

1

o3-mini (high)

65.27±5.52

1

Claude 3.7 Sonnet Thinking (February 2025)

65.27±5.49

1

Claude 3.7 Sonnet (February 2025)

64.93±5.51

1

o3-mini (medium)

64.93±5.51

1

GPT-4o (May 2024)

64.58±5.52

1

64.23±5.53

1

GPT-4.5 Preview (February 2025)

63.76±5.56

1

Gemini 2.0 Flash Thinking Experimental (January 2025)

63.19±5.57

1

DeepSeek-V3 (December 2024)

62.50±5.59

1

Gemini 2.0 Pro Experimental (February 2025)

61.45±5.62

1

GPT-4 Turbo Preview

60.76±5.64

1

Gemini 1.5 Pro (August 27, 2024)

60.28±5.66

1

Gemini 2.0 Flash Experimental (December 2024)

60.07±5.65

1

GPT-4o (August 2024)

59.93±5.67

1

Claude 3.5 Sonnet (June 2024)

59.38±5.67

4

Gemini 2.0 Flash

55.60±5.73

5

Claude 3 Sonnet

54.17±5.78

8

Gemini 2.0 Flash Lite Preview (February 2025)

53.82±5.75

12

Claude 3 Opus

52.78±5.77

14

GPT-4o mini

51.74±5.77

14

GPT-4

51.39±5.77

15

Llama 3.1 405B Instruct

50.35±5.78

15

Mistral Large 2

50.35±5.78

21

Nova Pro

47.04±5.77

25

Gemini 1.5 Pro (May 2024)

40.42±5.68

29

Llama 3.1 70B Instruct

37.23±5.60

29

Nova LIte

36.53±6.72

30

Command R+

30.21±5.30

30

Nova Mirco

28.93±6.33

35

Llama 3.1 8B Instruct

17.42±4.39