When AI Agents Ask, Attackers Can Answer

When AI agents encounter ambiguous instructions, the right move is to pause and ask for clarification. The problem with this, however, is that asking for clarification opens a new path for malicious instructions to enter the agent's context. The instruction can come from a document or channel the user consults to answer, or from a system intermediating between the user and the agent. What matters is this: the agents often treat clarification responses as task-relevant input, regardless of the source.

To address this problem, researchers at Scale Labs developed a new benchmark called ASPI (Ambiguous-State Prompt Injection). The team found that the same agents that pass current security evaluations are vulnerable to prompt injection attacks the moment they ask a user a question. For several models, attacks succeeded 10x more often.

Most Frontier Models Are Vulnerable

Prompt injection research focuses on malicious content that arrives through the agent's tools, hidden inside emails, documents, or webpages the agent retrieves. Agents are trained to be cautious of that channel, but our researchers found that for most models, the opening that occurs when an agent seeks clarification from the user is a serious vulnerability.

To measure this problem, ASPI tests 728 task-attack scenarios across four domains (workspace, messaging, travel, banking). Built on top of AgentDojo, an established benchmark for prompt injection in tool-using agents, it runs each scenario twice: once with a complete instruction, once with a detail removed so the agent has to ask.

The pattern holds across nearly every model tested. Claude Opus 4.7 was the only model that stayed robust in both settings. GPT-5.5 and GPT-5.4 showed smaller but real gaps. Most other models showed dramatic jumps:

Attack Success Rate (ASR) by model: execution-state tool calls vs. clarification-state ask_user calls. Error bars from the original chart are not shown.

Proprietary frontier models start nearly immune to standard prompt injection and become vulnerable only under clarification. Open-source models start from much weaker baselines and stay weaker across both settings. DeepSeek V3.2 sits at 65% attack success before clarification even enters the picture. For these models, clarification widens an already large gap.

Defenses Help, But Don't Close the Gap

ASPI also tested two lightweight defenses:

A prompt guard that filters suspicious content from user and tool messages
A tool filter that restricts the actions available to the agent before it acts

Both defenses helped, but neither solved the problem:

Model	No defense	Prompt guard	Tool filter
Gemini 3 Flash	35.7%	27.0%	23.9%
Gemini 3.1 Pro	24.3%	12.3%	19.3%

This is meaningful progress, but filters aren’t a total fix because the answer the agent needs and the instruction it should ignore often arrive in the same message. One can't reliably be removed without breaking the other. The clarification turn is a unique security surface; building agents that ask questions safely means designing for that moment and testing it directly.

Failure Can Look Like Success

The most important failure mode in ASPI is an agent that completes the user's task correctly while also carrying out the attacker's instruction, with no sign in the task output that anything went wrong. Many agent benchmarks measure whether the task was completed, but ASPI shows that task completion can coexist with successful attacks. An agent can succeed on every task-completion check while failing on security.
Agent evaluations need to measure utility and security together. Failing at a task is not as bad an outcome as completing the task while quietly executing a harmful task it should otherwise refuse. For teams evaluating agents before deployment, this is the gap that gets missed. Standard evaluations do not catch it, so models can clear current security evaluations and still reach production with the vulnerability live.

The Next Phase of Agent Security

Agents are becoming useful because they can handle incomplete, messy, real-world work. The same messiness creates new attack surfaces. The goal here is to recognize that asking is now a security boundary, and to evaluate it as one. It is necessary to treat responses to those questions as potentially untrusted, regardless of which channel they arrive through.

Utility and security should be scored together, because an agent that completes the task while quietly executing an attacker's instruction is one of the worst outcomes a deployment can produce. The variation across models shows this is solvable. The next step is making clarification-time security a standard part of how agents are tested before they're deployed.

When AI Agents Ask, Attackers Can Answer

Most Frontier Models Are Vulnerable

Defenses Help, But Don't Close the Gap

Failure Can Look Like Success

The Next Phase of Agent Security

Ready to break through your data bottleneck?