A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele (Mike) Lunati†, Summer Yue†
This study addresses concerns about dataset contamination in LLMs by introducing the Grade School Math 1000 (GSM1k) benchmark, designed to match the style and complexity of the widely-used GSM8k benchmark for mathematical reasoning. The research shows that many models, including Phi and Mistral, experience accuracy drops of up to 13% on GSM1k, indicating systematic overfitting, while frontier models like GPT and Claude show minimal overfitting. A positive correlation (Spearman’s r² = 0.32) was found between a model's likelihood of generating examples from GSM8k and the performance gap between GSM8k and GSM1k, suggesting partial memorization of GSM8k by some models.
The study evaluated LLMs on both the GSM8k and GSM1k benchmarks using a standardized evaluation setup for fair comparison. Open-source models were tested using EleutherAI's LM Evaluation Harness, and closed-source models were queried through the LiteLLM library. All models were tested with identical prompts using five examples from the GSM8k train set. Results indicate that some lesser-known models, particularly those near the top of the OpenLLMLeaderboard, performed significantly worse on GSM1k, suggesting they may have over-optimized for GSM8k, in line with Goodhart’s law. The full evaluation code will be released for reproducibility.
Analysis
-
Certain model families, notably Phi and Mistral, consistently overfit to GSM8k, showing better performance on it compared to GSM1k across almost all model versions and scales. Other model families, including Yi, Xwin, Gemma, and CodeLlama, also exhibit this pattern but to a lesser degree.
-
Many models do not display signs of overfitting and perform similarly on both GSM8k and GSM1k. Two possible reasons for this are: 1) these models possess advanced reasoning abilities, allowing them to generalize to new problems even if they have seen GSM8k data, and 2) frontier model builders may be more cautious about data contamination.
-
Overfit Models Are Still Capable of Reasoning. The fact that a model is overfit does not mean that it is poor at reasoning, merely that it is not as good as the benchmarks might indicate it to be. In fact, we find that many of the most overfit models are still capable of reasoning and solving novel problems.
-
Data Contamination Is Likely Not The Full Explanation for Overfitting. This result suggests that some of the reason for overfitting is due to partial memorization of the test set. Overfitting on GSM8k may be through other indirect means, such as model builders collecting data similar in nature to benchmarks as training data or selecting final model checkpoints based on performance on benchmarks, even if the model itself may have not seen the GSM8k dataset at any point via training. Conversely, the reverse is also true: small amounts of data contamination do not necessarily lead to overfitting.
Bibtex Citation
@inproceedings{zhang2024carefulexaminationlargelanguage, title={A Careful Examination of Large Language Model Performance on Grade School Arithmetic}, author={Hugh Zhang and Jeff Da and Dean Lee and Vaughn Robinson and Catherine Wu and Will Song and Tiffany Zhao and Pranav Raja and Charlotte Zhuang and Dylan Slack and Qin Lyu and Sean Hendryx and Russell Kaplan and Michele Lunati and Summer Yue}, booktitle={Advances in Neural Information Processing Systems}, year={2024}