TutorBench: A Benchmark To Assess Tutoring Capabilities Of Large Language Models

As students increasingly adopt large language models (LLMs) as learning aids, it is essential to evaluate their tutoring capabilities rigorously. In this paper, we introduce TUTORBENCH, a dataset and evaluation benchmark designed to evaluate the core tutoring skills of LLMs. The dataset comprises 1,490 prompts curated by human experts, focused on high-school and AP-level curricula. TUTORBENCH consists of examples drawn from three common tutoring tasks: (i) generating adaptive explanations tailored to a student’s confusion, (ii) providing actionable feedback on a student’s work, and (iii) promoting active learning through effective hint generation. TUTORBENCH is accompanied by a reliable and fine-grained automatic evaluation method that uses an LLM-judge and sample-specific rubrics. We evaluate 16 frontier LLMs on TUTORBENCH and present a detailed analysis of their performance and behavior. Our results show that none of the frontier LLMs achieve a score of greater than 56%, showing a large room for improvement. We find that LLMs still fall short in exhibiting the full range of tutoring skills needed to guide, diagnose, and support students’ reasoning effectively, with all the frontier models achieving less than 60% pass rate on rubric criteria related to these skills. We also find that different model families exhibit varied strengths and limitations: the Claude models outperform others in supporting active learning, while they trail behind them in the other two use cases. By releasing TUTORBENCH and an accompanying leaderboard, we provide a comprehensive and unsaturated benchmark to guide the development of the next-generation of AI tutors.