Scale AI logo
SEAL Logo

Agentic Tool Use (Chat)

Last updated: April 10, 2025

Performance Comparison

1

o3-mini (high)

63.45±5.52

1

62.43±6.76

1

o3-mini (medium)

62.42±6.76

1

61.42±6.79

1

DeepSeek-R1

60.91±6.81

1

o1 (December 2024)

60.41±6.82

1

DeepSeek-V3 (December 2024)

58.55±6.87

1

Gemini 2.0 Pro Experimental (February 2025)

57.86±6.89

1

Gemini 2.0 Flash Thinking Experimental (January 2025)

57.36±6.91

1

GPT-4o (August 2024)

56.85±6.92

1

GPT-4.5 Preview (February 2025)

56.34±6.92

1

Claude 3.7 Sonnet (February 2025)

56.25±6.98

1

Claude 3.5 Sonnet (June 2024)

56.06±6.91

1

Claude 3.7 Sonnet Thinking (February 2025)

55.32±6.94

1

o1-preview

55.10±6.96

1

54.31±6.95

1

Gemini 2.0 Flash Experimental (December 2024)

53.29±6.96

1

GPT-4 Turbo Preview

53.03±6.95

1

Gemini 1.5 Pro (August 27, 2024)

51.27±6.98

2

Gemini 2.0 Flash Lite Preview (February 2025)

50.25±6.98

2

GPT-4o (May 2024)

49.50±6.96

2

Nova Pro

49.23±6.98

4

Claude 3 Opus

48.49±6.96

4

Gemini 2.0 Flash

48.22±6.97

17

Nova LIte

41.66±5.69

17

Nova Mirco

40.97±5.67

17

Claude 3 Sonnet

40.40±6.84

17

Mistral Large 2

40.40±6.84

17

Llama 3.1 405B Instruct

40.10±6.84

19

GPT-4

37.88±6.78

23

Gemini 1.5 Pro (May 2024)

35.50±6.57

25

Llama 3.1 70B Instruct

33.50±6.59

25

GPT-4o mini

32.83±6.54

34

Command R+

20.20±5.59

35

Llama 3.1 8B Instruct

6.09±3.34