HiL-Bench: A Human-in-the-Loop Benchmark for AI Agents | Scale

Frontier agents solve up to 89% of complex tasks when given complete information. Remove a few critical details, the kind routinely missing from real-world specs, and performance drops to as low as 4%. The agents don't error out. They don't flag uncertainty. They fill in the blanks with confident assumptions, produce plausible wrong output, and move on. This is a dominant failure mode in production agent deployments, and no benchmark has measured it until now.

HiL-Bench (Human-in-Loop Benchmark) is designed to test one thing: does an agent know when to ask for help? It takes well-defined tasks from established benchmarks (SWE-Bench Pro for software engineering, BIRD for text-to-SQL), then introduces realistic information gaps: missing details, ambiguous requirements, contradictory specifications. These gaps aren’t visible from the prompt alone. They surface as the agent explores the codebase, inspects the database, or tries to execute.

Current benchmarks can’t see this. They supply complete specifications and reward confident execution. An agent that would silently guess past a missing requirement and get lucky can score identically to one that would have asked. But in production, luck runs out. Wrong assumptions compound: a misread requirement becomes a wrong implementation, which becomes a failed deployment. The agent never flags it because it never recognized the gap in the first place.

The Judgment Gap: Why AI Agents Don't Ask for Help

With full information, frontier models solve 75% to 89% of tasks. Give them a tool to ask for clarification but let them decide when to use it, and performance stays low, peaking at 38% on SQL and 12% on SWE. The mechanism to ask exists, but not the judgment to use it. The obvious assumption is that agents simply don't ask. The reality is stranger. We analyzed over 3,600 failure traces and found that each model family breaks in its own way.

The Judgment Gap (pass@3)

Selective Escalation Quality

How GPT, Claude, and Gemini fail differently on ambiguous tasks

GPT models execute confidently on wrong beliefs. They don't detect the gap, but build on top of it. Claude detects uncertainty but doesn't resolve it. In 45% of its alignment failures, Claude explicitly recognized it was stuck, then submitted an answer anyway. It was the only model to verbalize the problem. Gemini asks more often, but usually too broadly. It identifies gaps but its questions are too vague to extract the right information.

One model needs to learn to detect uncertainty. Another needs to learn to act on it. The third needs to learn to ask precisely. The help-seeking gap is a training problem, not a capability problem.

Judgment is Trainable: Improving Help-Seeking with Reinforcement Learning

Using reinforcement learning, we trained a 32B model to improve its help-seeking behavior. Ask-F1 (our metric for help-seeking quality) improved by 28 percentage points on SQL and 17 on SWE. Task pass rates rose in lockstep.

The most important finding: a model trained exclusively on SQL tasks showed improved help-seeking on software engineering tasks it had never seen. The model didn't learn domain-specific patterns for when to ask, but to detect unresolvable uncertainty and act on it.

Selective Escalation: The Defining Skill of a Reliable AI Agent

No matter how capable models become, there will always be context locked in someone's head, buried in tribal knowledge, or sitting in decisions that were never documented. No model can infer its way past that. The defining skill of a reliable agent is selective escalation: knowing when to stop, ask the right question, and keep going. HiL-Bench is the first benchmark designed to measure whether your agent can.

Resources

Leaderboard on Scale Labs
Data on Hugging Face
Code & Harness on GitHub
Discussion on our Chain of Thought podcast:

HiL-Bench: Your Agent is Smart. It Just Won't Ask for Help.