Scale Logo
SEAL Logo

Agentic Tool Use (Chat)

Last updated: April 10, 2025

Performance Comparison

1

o3-mini (high)

63.45 ±5.52

1

62.43 ±6.76

1

o3-mini (medium)

62.42 ±6.76

1

61.42 ±6.79

1

DeepSeek-R1

60.91 ±6.81

1

o1 (December 2024)

60.41 ±6.82

1

DeepSeek-V3 (December 2024)

58.55 ±6.87

1

Gemini 2.0 Pro Experimental (February 2025)

57.86 ±6.89

1

Gemini 2.0 Flash Thinking Experimental (January 2025)

57.36 ±6.91

1

GPT-4o (August 2024)

56.85 ±6.92

1

GPT-4.5 Preview (February 2025)

56.34 ±6.92

1

Claude 3.7 Sonnet (February 2025)

56.25 ±6.98

1

Claude 3.5 Sonnet (June 2024)

56.06 ±6.91

1

Claude 3.7 Sonnet Thinking (February 2025)

55.32 ±6.94

1

o1-preview

55.10 ±6.96

1

54.31 ±6.95

1

Gemini 2.0 Flash Experimental (December 2024)

53.29 ±6.96

1

GPT-4 Turbo Preview

53.03 ±6.95

1

Gemini 1.5 Pro (August 27, 2024)

51.27 ±6.98

2

Gemini 2.0 Flash Lite Preview (February 2025)

50.25 ±6.98

2

GPT-4o (May 2024)

49.50 ±6.96

2

Nova Pro

49.23 ±6.98

4

Claude 3 Opus

48.49 ±6.96

4

Gemini 2.0 Flash

48.22 ±6.97

17

Nova LIte

41.66 ±5.69

17

Nova Mirco

40.97 ±5.67

17

Claude 3 Sonnet

40.40 ±6.84

17

Mistral Large 2

40.40 ±6.84

17

Llama 3.1 405B Instruct

40.10 ±6.84

19

GPT-4

37.88 ±6.78

23

Gemini 1.5 Pro (May 2024)

35.50 ±6.57

25

Llama 3.1 70B Instruct

33.50 ±6.59

25

GPT-4o mini

32.83 ±6.54

34

Command R+

20.20 ±5.59

35

Llama 3.1 8B Instruct

6.09 ±3.34