2025 Scale AI. All rights reserved.

Math

Last updated: July 23, 2025

Performance Comparison

1

Claude 3.5 Sonnet (June 2024)

96.60±1.02

2

GPT-4o (August 2024)

95.68±1.15

3

Llama 3.1 405B Instruct

95.60±1.16

4

Claude 3 Opus

95.19±1.21

5

GPT-4 Turbo Preview

95.10±1.22

6

GPT-4o (May 2024)

94.85±1.25

7

Gemini 1.5 Pro (August 2024)

94.69±1.27

8

Mistral Large 2

93.94±1.35

9

Claude 3 Sonnet

93.28±1.41

10

Gemini 1.5 Pro (May 2024)

92.28±1.51

11

Gemini 1.5 Pro (April 2024)

90.54±1.65

12

Llama 3 70B Instruct

90.12±1.69

12

Gemini 1.5 Flash

90.12±1.69

14

Mistral Large

87.47±1.87

15

Gemini 1.0 Pro

79.83±2.27

16

CodeLlama 34B Instruct

37.51±2.73