Humanity's Last Exam (Text Only)
Models evaluated on text-only HLE questions
Data Sample
Here is a representation of a Roman inscription, originally found on a tombstone. Provide a translation for the Palmyrene script.
A transliteration of the text is provided: RGYNᵓ BT ḤRY BR ᶜTᵓ ḤBL
Henry T
Merton College, Oxford
We report metrics on additional text-only models, evaluated on text-only HLE questions, representing 86% of the dataset. See the multimodal benchmark here.
Update April 3, 2025
HLE has been finalized to 2,500 questions. The previous version of the leaderboard is now under the “Legacy” section and will be referred to as “HLE-preview”. All current model performance on this version of HLE is similar to the previous version.
Changes
-
We removed all errors correctly flagged as part of our community feedback bug bounty program. This program ended on March 21, 2025.
- Searchable questions were removed by the following procedure. A question is potentially searchable if a model with search tools answered correctly, but answered incorrectly without search. Each of these potentially searchable questions was then manually audited, removing any that were easily found via web search. We used GPT-4o mini/GPT-4o search and Perplexity Sonar models in this procedure.
-
A backup pool of high quality questions was used to replace a portion of the questions removed.
Introduction
AI capability is evaluated based on benchmarks, yet as their progress accelerates, benchmarks become quickly saturated, losing their utility as a measurement tool. Performing well on formerly frontier benchmarks such as MMLU and GPQA are no longer strong signals of progress as frontier models reach or exceed human level performance on them.
In partnership with the Center for AI Safety, we address the problem of benchmark saturation by creating Humanity’s Last Exam (HLE): 2,500 of the toughest, subject-diverse, multi-modal questions designed to be the last academic exam of its kind for AI. HLE is designed to test for both depth of reasoning (eg. world-class mathematical problems) and breadth of knowledge across its subject domains, providing a precise measurement of model capability. Current frontier models perform poorly on HLE with low accuracies, and systematically exhibit uncalibrated overconfidence in their answers.
We publicly release Humanity’s Last Exam for the research community to better understand model capabilities. Evaluation is low-cost, as questions are precise and unambiguous with closed-ended answers provided – allowing for automatic evaluation. To combat the serious problem of training data contamination and benchmark hacking, we have an additional held-out private set of HLE questions to periodically measure overfitting to the public dataset. More research on overfitting can be found here.
High accuracy on HLE would demonstrate AI has achieved expert-level performance on closed-ended cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or “artificial general intelligence.”
See the linked full paper and dataset.
Methodology
- Count the number of models that are statistically significantly better than the target model.
- Add 1 to this count to determine the model’s rank.
Dataset Summary
Humanity’s Last Exam includes questions across dozens of subjects across mathematics, humanities, and the natural sciences. We provide a high level visualization of the distribution of the benchmark categories – though there are many subjects within each summarized category.
The benchmark is multimodal, with 14% of questions requiring comprehending a diagram or figure to answer the question. In addition, 24% of the questions are multiple choice.
Dataset Design
Humanity’s Last Exam is a collaborative effort with questions from nearly 1000 subject expert contributors, affiliated with over 500 institutions across 50 countries – composed mostly of professors, researchers, and graduate degree holders. Participants competed for a $500,000 USD prize pool – $5,000 USD for each of the top 50 questions and $500 USD for the next 500 questions, along with the opportunity for optional co-authorship if any question is accepted in the final dataset. This structure incentives top questions from subject experts all around the world. More details can be found in our original announcement: https://scale.com/blog/humanitys-last-exam.
Submission: Submitted questions must stump several frontier LLMs for exact match questions, or allow only up to random chance across all LLMs for multiple choice questions to be considered for human review. This ensures the questions are of a necessary difficulty bar for the current generation of models, we further verify they are sufficiently difficult with human review. In total, we received over 70,000 submissions, with 13,000 passing this difficulty bar and forwarded to human review.
Human Review: We train experts sourced from Scale’s Outlier platform to review questions. All of the human reviewers have a graduate degree in their field. Reviewers score questions against a standardized rubric, providing feedback to help question creators iterate questions. A primary round is used to shortlist the best questions. A secondary review with both organizers and expert reviewers approves or rejects questions from the final dataset – resulting in 2,700 public questions and an additional set of private questions of equal quality and difficulty. Subsequent community feedback and removal of searchable questions resulted in a finalized dataset of 2,500 questions.
Metrics
We report both the accuracy on public questions of Humanity’s Last Exam and use the model’s own stated confidence to derive an RMS calibration error using the implementation from Hendrycks et al., 2022 with the default hyperparameters provided, reported in our paper for brevity. Models are ranked on the leaderboard using accuracy, however we want to emphasize calibration errors as an important metric in our paper.
A well-calibrated model should exhibit an average confidence similar to its accuracy on a benchmark - eg. 50% accuracy paired with 50% confidence. As of our initial publication, we observe systematic high calibration errors (greater than 80%) paired with low accuracy (less than 10%), which indicates strong evidence for confabulation/hallucination in all measured models.
Details on our evaluation methodology found below. At this time, we do not report any model performance metrics on the private held-out set.
Evaluation Methodology
Evaluation is automatic. Each model on the leaderboard is evaluated on all public questions of Humanity’s Last Exam with temperature 0.0 when configurable or stated otherwise. Models are prompted to give a final answer and an estimation of confidence using the system prompts (or user prompt when not configurable) below depending on question type, following the setup from Wei et al., 2024.
Data Sample
Your response should be in the following format:
Explanation: {your explanation for your final answer}
Exact Answer: {your succinct, final answer}
Confidence: {your confidence score between 0% and 100% for your answer}
As HLE uses closed-form solutions, we use o3-mini-2025-01-31 as an automatic extractor and judge to compare the model response against the ground truth answer. We employ structured decoding to extract a JSON from the following prompt. We note small differences could arise from different judge models and prompts used on edge cases (eg. acceptable precision), hence we encourage the documentation of prompts and models used for evaluation on HLE. We document ours for this evaluation below.
Data Sample
Judge whether the following [response] to [question] is correct or not
based on the precise and unambiguous [correct_answer] below.
[question]: {question}
[response]: {response}
Your judgement must be in the format and criteria specified below:
extracted_final_answer: The final exact answer extracted from the [response]. Put the extracted answer as ’None’ if there is no exact, final answer to extract from the response.
[correct_answer]: {correct_answer}
reasoning: Explain why the extracted_final_answer is correct or incorrect
based on [correct_answer], focusing only on if there are meaningful differences between [correct_answer] and the extracted_final_answer. Do not comment on any background to the problem, do not attempt to solve the problem, do not argue for any answer different than [correct_answer], focus only on whether the answers match.
correct: Answer ’yes’ if extracted_final_answer matches the
[correct_answer] given above, or is within a small margin of error for
numerical problems. Answer ’no’ otherwise, i.e. if there if there is any
inconsistency, ambiguity, non-equivalency, or if the extracted answer is
incorrect.
confidence: The extracted confidence score between 0% and 100% from
[response]. Put 100 if there is no confidence score available.
Acknowledgements
Humanity’s Last Exam was a global collaborative effort developed in partnership with the Center for AI Safety. We extend our deepest gratitude to all participating question contributors and expert reviewers involved in creating and refining the Humanity’s Last Exam dataset.
Scale AI Team: Ziwen Han, Josephina Hu, † Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, William Qian, Luis Esquivel, Caton Lu, Monica Mishra, Summer Yue, Alexandr Wang
Performance Comparison
20.57±1.71Calib Err: 36
19.78±1.68Calib Err: 37
18.90±1.65Calib Err: 58
18.38±1.64Calib Err: 71
14.53±1.49Calib Err: 58
13.37±1.44Calib Err: 80
12.58±1.40Calib Err: 81
11.75±1.36Calib Err: 74
10.31±1.28Calib Err: 81
DeepSeek R1
8.54±1.18Calib Err: 73
7.89±1.14Calib Err: 81
7.75±1.13Calib Err: 84
7.71±1.13Calib Err: 82
6.55±1.04Calib Err: 82
5.80±0.99Calib Err: 83
5.34±0.95Calib Err: 84
4.97±0.92Calib Err: 88
4.55±0.88Calib Err: 80
4.55±0.88Calib Err: 87
4.41±0.87Calib Err: 84
4.32±0.86Calib Err: 83
4.32±0.86Calib Err: 80
3.76±0.80Calib Err: 83
2.32±0.64Calib Err: 88
Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.
-
CE: Calibration Error