April 20, 2026

Frontier agents ace tasks with complete specs, then crash to 4% when key details are missing. They never ask for help. HiL-Bench is the first benchmark that tests whether they know when to.
Read more
July 28, 2025

Scale AI is one of the world’s leading suppliers of high-quality foundation model training data. In recent months we’ve seen a marked increase in demand for PhD-level, multimodal reasoning data across a multitude of domains, including math, coding, science, and humanities. Because of the ever-increasing difficulty of such expert-level data, ScaleAI invests in frontier research on quality control (QC) with LLM agents, also known as “autoraters.”
Read more