Search-Time Data Contamination

Data contamination traditionally refers to the leakage of evaluation data into model training data, resulting in overfitting to supposedly held-out test sets and compromising test validity. We identify an analogous issue—search- time contamination (STC)—in evaluating search-based LLM agents which use tools to gather information from online sources when answering user queries. STC occurs when the retrieval step surfaces a source containing the test question (or a near-duplicate) alongside its answer, enabling agents to copy rather than genuinely infer or reason, undermining benchmark integrity.

We find that HuggingFace, an online platform hosting evaluation datasets, appears among retrieved sources in search-based agent logs. Consequently, agents often explicitly acknowledge discovering question-answer pairs from HuggingFace within their reasoning chains. On three commonly used capability benchmarks—Humanity’s Last Exam (HLE), SimpleQA, and GPQA—we demonstrate that for approximately 3% of questions, search-based agents directly find the datasets with ground truth labels on HuggingFace.

As a result, accuracy on the contaminated subset shows non-trivial gains on HLE and SimpleQA. While 3% is perhaps only significant for frontier benchmarks like HLE (e.g. 1% change can change the overall ranking), it is more important to highlight that we cannot fully trust the evaluation results as we did when evaluating the models without online access. When millions of evaluation queries target the same benchmark, even small, repeated leaks can accelerate the benchmark’s obsolescence, shortening its intended lifecycle.

After HuggingFace is blocked, we observe a drop in accuracy on the contaminated subset of approximately 15%. We further show through ablation experiments that publicly accessible evaluation datasets on HuggingFace may not be the sole source of STC. To this end, we conclude by proposing best practices for benchmark design and result reporting to address this novel form of leakage and ensure trustworthy evaluation of search-based LLM agents. To facilitate the auditing of evaluation results, we also publicly release the complete logs from our experiments.