Research

WebGuard: A Guardrail for the Agentic Age

byonJuly 23, 2025

AI is fast approaching the agentic age, and there is no more important metric to assess our preparedness than safety. Unlike humans who are capable of understanding the weight and complexity of their environment and impacts of their behavior when browsing websites, agents lack the real-world experience to fully anticipate and appreciate the consequences of their actions. These concerns are amplified by the nature of agents to prioritize task completion while overlooking potential side-effects, with hallucinations or grounding failures potentially causing additional incorrect, unintended, and undesirable behaviors. 

We can only mitigate these concerns by understanding and improving model capabilities. Researchers from Scale, UC Berkeley, and Ohio State University have created WebGuard to do just that. WebGuard is the first comprehensive dataset designed to support the assessment of web agent action risk and facilitate the development of guardrails for real-world online environments. 

WebGuard’s Makeup

WebGuard is an action-level dataset, meaning it examines for safety every discrete action a task requires. The first of its kind, it contains 4,939 human-annotated actions gathered from 193 different websites across 22 diverse domains. The creation of this data was a meticulous process, requiring trained professionals to pass qualification tests before annotating, with every action later verified by three separate reviewers to ensure quality and consistency.

The WebGuard dataset covers a wide range of real-world websites, from shopping and travel to finance and legal services.

Every action in the dataset is categorized using a clear three-tier risk schema:

  • SAFE: Actions with little to no impact, like navigating between pages or typing in a search bar without submitting.

  • LOW: Actions with minor, reversible consequences for the user, such as adding an item to a shopping cart or changing language preferences.

  • HIGH: Actions with significant or irreversible consequences that could involve financial, legal, or ethical risks, like posting a public review or deleting an account.

WebGuard system in action: the guardrail assesses an agent's proposed action and allows a human user to intervene before a high-risk step is taken.

Model Size Matters Less

The research results show that the quality and specificity of training data can be more impactful than model size for creating safe AI. The power of the WebGuard dataset becomes clear when considering its impact on model performance. Before fine-tuning, even the best frontier models struggled with the benchmark, often scoring below 60% at categorizing the level of risk. 

Simply using a bigger model did not guarantee better safety. Advanced reasoning models like Claude-3.7-Sonnet with extended thinking and o3 achieved higher overall accuracy but also exhibited a tendency to underestimate threats, resulting in lower recall for high-risk actions, blocking risky behavior less than 60% of the time. Their worst performance came on uncommon, low-traffic websites, suggesting the models rely heavily on pre-existing knowledge of popular sites.

Small Models Punch Above their Weight

Fine-tuning models on the data in WebGuard yielded transformative results. A fine-tuned Qwen2.5-VL-7B model, for example, saw its accuracy improve from 37% to 80%, while its recall for high-risk actions improved from 20% to 76%. This held true even for smaller, more efficient models. A fine-tuned 3-billion parameter model, for example, also achieving 76% accuracy, outperforming some massive general-purpose frontier models in their zero-shot state. 

This fine-tuning also unlocked the power of a new modality. Initially, models using only text-based data outperformed those that could also see screenshots. After fine-tuning, however, this pattern reversed, with the multimodal model achieving the best overall performance.

These results show that safety isn't just an algorithmic matter. It is a problem solved by having humans-in-the-loop to generate the high-quality, targeted training signals needed to teach models complex, real-world concepts. These results are promising, but they still fall short of numbers needed for high-stakes deployments. The research also highlights that challenges remain, particularly in generalizing these safety skills to websites from entirely new and unseen domains.

A Closer Look at Remaining Challenges

A closer look at the results reveals where these challenges lie. The fine-tuning improvements were most consistent when adapting to new websites within familiar domains and handling new actions on familiar websites, while generalizing to entirely new domains remained more difficult. The paper's error analysis also shows that even after training, models often get confused by surface-level cues. This might include flagging a checkout button as high-risk even if it doesn't finalize a purchase, or misclassifying as safe an intermediate step because the overall goal is high-stakes.

A Blueprint for Agentic Safety

WebGuard is a significant step toward having the safest agents possible. Though near 80% success accuracy or recall is not yet high enough to say agents are truly ready for prime-time, WebGuard is an essential foundation for developing these systems. To accelerate this effort, the dataset, annotation tools, and fine-tuned models have been made public, serving as a call to the community to collectively address this critical safety challenge.

 


The future of your industry starts here