Scale AI logo
SEAL Logo

EnigmaEval

Introduction

Advanced reasoning capabilities of Large Language Models (LLMs) have created a significant evaluation challenge as these models increasingly saturate traditional benchmarks. This necessitates new approaches to assess their capabilities and limitations effectively.

We introduce EnigmaEval, a benchmark derived from puzzle hunts—a repository of sophisticated problems from the global puzzle-solving community. Puzzle-solving offers unique challenges that combine multiple domains of knowledge with sophisticated reasoning requirements. Unlike conventional evaluation tasks with explicit instructions, puzzles demand creative problem-solving and the ability to synthesize information across diverse fields—from mathematical and logical reasoning to cultural knowledge and linguistic manipulation.

While existing puzzle-based benchmarks exist, they typically focus on narrow domains like sudoku or crosswords. Similarly, established reasoning and knowledge tests such as MATH, MMLU and GPQA, though rigorous, operate within well-defined problem spaces. This reveals a critical gap in our ability to evaluate LLMs' creative reasoning capabilities on complex, unstructured challenges. EnigmaEval addresses this gap by incorporating both original multimodal puzzles and human transcriptions, enabling comprehensive evaluation of both reasoning capabilities and multimodal processing abilities in AI systems.

State-of-the-art models struggle significantly with these puzzles, achieving only modest success rates on simpler problems and failing entirely on more challenging ones. EnigmaEval joins Humanity's Last Exam (HLE) in establishing a new class of extremely challenging benchmarks that expose current models' limitations.

The puzzles and solutions are private to maintain the benchmark's integrity and respect puzzle authors' preferences not to distribute their puzzles widely. We will keep the leaderboard updated, and anyone interested in access to this dataset should fill out our request form.

See the linked full paper with comprehensive results.

Dataset Summary

EnigmaEval comprises 1184 puzzles collected from eight diverse sources (see Table 1), categorized into normal puzzles from beginner-friendly to advanced competitions, and hard puzzles that require five or more complex steps with minimal verification and thematically hidden intermediate answers. All sources feature rich multimodal content, combining text with visual elements like grids, pictures, diagrams, and their meaningful arrangements.

Split (%)

Puzzle Source

Description

Normal (80%)

PuzzledPint

Monthly beginner-friendly puzzle-solving event typically consisting of around seven puzzles including a meta-puzzle. Together, these puzzles should be solvable in under two hours by a team of inexperienced puzzlers who are socializing, drinking, and eating in a pub or restaurant.

CS50x Puzzle Day

Annual puzzle set designed for small teams of beginner problem-solvers.

Puzzle Potluck

Public puzzle hunt hosted four times, designed to be accessible while more competitive than PuzzledPint.

Cryptic Crosswords

Collection of crosswords by Mark Halpin featuring unique mechanics and non-standard layouts. Final answers require combining filled grid elements.

CRUMS

A short puzzle hunt hosted in 2020.

Hard (20%)

MIT Mystery Hunt

Massive annual event at MIT with hundreds of puzzles. Our benchmark includes selected puzzles from various hunts.

Labor Day Extravaganza

Annual online hunt by Mark Halpin with multiple puzzles and a meta-puzzle, designed for experienced solvers over several days.

Grandmaster Puzzles

A small collection of puzzle hunt-style puzzles published in a blog.

Table 1. Overview of puzzle sources in EnigmaEval.

Dataset Collection

We collected puzzles from their original online archives in PDF and HTML formats across the sources in Table 1, filtering them through several criteria: (1) we excluded complex meta-puzzles (except for 77 cases where titles and prior answers were sufficient for independent solving), (2) removed puzzles requiring audio/video or interactive elements due to current model limitations, and (3) only included content with explicit author consent or Creative Commons licensing.

Alongside each puzzle, we collected solution documents and created standardized text-image transcriptions through human annotation, enabling evaluation of models on both original formats and transcribed versions. This dual approach helps distinguish between reasoning failures and document parsing challenges. The transcription process, which proved too complex for automation, involved removing source identifiers, preserving complex layouts, and ensuring accurate text extraction, while solutions were manually validated and tagged by answer type. All transcriptions underwent rigorous human review for quality assurance.

Evaluation Methodology

We evaluate models by comparing their answers to ground-truth solutions through string matching.

The models generate responses using format-specific system prompt templates, that require both a step-by-step solution and a final answer in a standardized format, enabling consistent answer extraction.

Metrics

Model performance is measured using accuracy (pass@1) with standard deviations on 3 evaluation runs. We expand the evaluations by looking at meta-puzzles, which require synthesizing solutions from multiple component puzzles. We provide the model with correct component solutions, allowing us to isolate its meta-reasoning capabilities from its performance on individual puzzles. We also evaluate performance on our standardized human transcriptions relative to the original PDF format.

Acknowledgments

We would like to thank all the authors of the puzzles that made this benchmark possible. We are especially grateful to Mark Halpin for giving us permission to include his large collection of puzzles including Labor Day Extravaganzas, cryptic crosswords, and MIT Mystery Hunt puzzles. Thank you as well to Dave Shukan, Seth Bisen-Hersh, Brian Chen (betaveros), Evan Chen, Jeck Lim and Sami Casanova for giving us permission to include their puzzles from the MIT Mystery Hunt. Thank you to Rajeev Nayak, Bradley Wu, Curtis Liu, Darren Yin, Julz Huang, Lindsey Shi, and Stephanie Chang for creating Puzzle Potluck and for giving us permission to include it in this benchmark. Thank you to the many organizers and puzzle writers who make PuzzledPint possible and for generously making the puzzles available under a Creative Commons license. Thank you to David Malan, Meta and the CS50x staff for writing the CS50x puzzles and making them available under a Creative Commons license. We are grateful to Zach Barnett, Alex Walker, and Sara Walker for creating CRUMS and making it available under a Creative Commons license. Thank you to Thomas Snyder (drsudoku) for writing many of the Grandmaster Puzzles and making them available under a Creative Commons license.

Appendix

Accuracy rates on puzzles for frontier models for normal split (N=949) and hard split (N=235). Means and standard deviations over 3 evaluation runs.

Model Name

Normal (%, mean +/- stdev)

Hard (%)

o1 (December 2024)

7.05 +/- 0.58

0.0

Claude 3.7 Sonnet Thinking (February 2025)

5.27 +/- 0.49

0.0

GPT-4.5 Preview (February 2025)

3.95 +/- 0.30

0.0

Claude 3.7 Sonnet (February 2025)

2.78 +/- 0.66

0.14 +/- 0.33

Gemini 2.0 Flash Thinking (January 2025)

1.38 +/- 0.18

0.0

Claude 3.5 Sonnet (October 2024)

1.14 +/- 0.18

0.0

Pixtral Large (November 2024)

1.05 +/- 0.21

0.0

Claude 3 Opus

1.03 +/- 0.05

0.0

GPT-4o (November 2024)

1.00 +/- 0.13

0.0

Gemini 2.0 Pro Experimental (February 2025)

0.87 +/- 0.47

0.0

Gemini 2.0 Flash (February 2025)

0.79 +/- 0.26

0.0

Llama 3.2 90B Vision Instruct

0.48 +/- 0.06

0.0

Loading content...
Last updated: July 23, 2025

Performance Comparison

1

13.09±1.92

1

11.91±1.85

2

9.21±1.65

3

6.81±0.83

4

6.14±1.37

4

o1 (December 2024)

5.65±1.32

4

5.57±1.31

4

5.57±1.31

5

4.23±1.17

9

4.14±1.16

9

3.21±1.00

9

3.18±1.02

9

3.12±0.99

9

2.70±0.92

9

2.36±0.87

9

2.26±0.86

10

2.20±0.84

10

2.17±0.84

15

Gemini 2.0 Flash Thinking (January 2025)

1.10±0.60

16

Claude 3.5 Sonnet (October 2024)

0.91±0.55

17

Pixtral Large (November 2024)

0.84±0.53

19

Claude 3 Opus

0.82±0.45

19

GPT-4o (November 2024)

0.80±0.44

19

0.69±0.48

19

0.63±0.45

19

0.58±0.43

19

Llama 3.2 90B Vision Instruct

0.38±0.35

Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.