by Zainab Doctor, Bert Herring and Dani Gorman

Across the pharmaceutical industry, executives are under mounting pressure to demonstrate tangible returns on their AI investments. The patient-facing health chatbot has become one of the most attractive early deployments. The logic is straightforward: the use case is well-defined, the starting content base already exists (post-launch marketing materials), the regulatory surface area appears manageable compared to clinical or manufacturing applications, and the value proposition of giving patients faster, more consistent access to medication information is easy to articulate.
However, what appears simple obscures significant risk. A health chatbot is not a search interface or an FAQ replacement. It is a conversational AI system operating in a regulated environment, interacting with patients in real time, often during moments of vulnerability or uncertainty. It must meet technical, clinical, regulatory, and ethical standards, yet in most current deployments, the testing and governance frameworks do not reflect that reality.
Scale's Enterprise Red Team conducted a pilot analysis of a consumer-facing health chatbot live in production at a major pharmaceutical company, simulating real-world user interactions over a focused two-week testing window to surface potential vulnerabilities. The findings illustrate what is at stake when an agent moves to production without investing adequately in continuous safety evaluation.
Standard quality assurance processes are designed to verify that a model performs its intended function. Adversarial red teaming is designed to discover how a model behaves when users deviate from intended use, which is the norm in a public-facing consumer product.
Scale's evaluation identified two distinct categories of failure:
The first category encompasses failures within the chatbot's designed scope: cases where the model attempted to respond to a legitimate health question and produced output that was clinically incorrect, inconsistent, or inappropriate given the user's stated circumstances.
The second category, in some respects, is more concerning; it reflects failures that are entirely outside the chatbot's intended purpose, yet occurred in a live system.
These findings emerged from a production deployment by a leading global pharmaceutical company with the resources and organizational sophistication to build responsibly. Understanding why these failures occur is essential to addressing them.
The most common cause is a mismatch between how the system was built and tested and the range of inputs it will realistically receive. Pharma organizations typically develop health chatbots by fine-tuning or prompting a foundation model on proprietary content, conducting QA to confirm the model answers intended questions accurately, and deploying with a set of policy guidelines in the system prompt. Each of these steps addresses functional performance, but none of them systematically surface harmful behavior under real-world user conditions, adversarial or innocent.
A second contributing factor is the assumption that a chatbot built for a specific purpose will only be used for that purpose. This is not supported by how real-world users actually interact with conversational AI. Users bring crises, confusion, and malicious intent to any accessible interface. A health chatbot from a trusted pharmaceutical brand carries an implicit authority that can make it a particularly attractive target for misuse. We have also found that well-intentioned users can easily stumble into harm.
A third factor is underestimating regulatory complexity. The distinction between providing health information and medical advice has direct implications for Software as a Medical Device classification. This requires balancing the regulatory obligation to present risk information alongside benefit claims. Dosing guidance, off-label use, and adverse event information all carry specific compliance obligations that must be embedded in the system's design, not addressed after the fact through disclaimers.
The following requirements apply to any organization building or operating a consumer health chatbot.
Define scope with precision, not generality. A general instruction to "discuss only approved medication information" is not operationally sufficient. The content policy that governs the system must specify, at a granular level, which question types the model may answer, which require redirection to a healthcare professional, and which must result in a refusal. That policy should be developed collaboratively between medical affairs, regulatory, legal, and product teams, and translated into testable behavioral requirements, not just system prompt language.
Treat refusal as a safety-critical behavior. The most serious failures in our evaluation were failures to refuse. Refusal for content that could facilitate harm to self or others requires specific technical implementation, verification testing, and ongoing monitoring. Safe messaging guidelines for mental health and crisis scenarios should be implemented as a baseline. Refusal and other appropriate behaviors must be maintained through multiple turns of a conversation as well. Most users will not drop the conversation after a single refusal; they may also initiate an ask about a forbidden topic late in a conversation that previously had no red flags. This is difficult to do comprehensively with automated testing alone.
Invest in adversarial safety evaluation as a distinct discipline. Functional QA and adversarial red teaming serve different purposes and neither substitutes for the other. An adversarial evaluation should be conducted by a team whose mandate is to identify harmful outputs, not to confirm correct ones. This applies at pre-launch and on a continuous basis post-deployment. Model behavior may change and even regress through updates and governance frameworks must reflect that.
Establish clinical accuracy as an operational standard. Health chatbots that generate responses dynamically rather than retrieving them from a controlled knowledge base are structurally prone to inconsistency. Any information the model surfaces about clinical topics should be derived from a single, versioned, medically reviewed source of record, with a defined process for keeping that content current as the scientific and regulatory landscape evolves.
Pharmaceutical companies are not new to managing the tension between innovation and patient safety. It is the defining operational challenge of the industry. AI deployment is not categorically different; it is a new domain in which that same discipline must be applied.
Closing the gap between current practice and the risks these systems pose requires treating health chatbots as patient-facing systems operating in a regulated environment, rather than as digital content products. Safety and compliance require continuous evaluation as models are updated, user behavior evolves, and regulatory expectations mature.
Have a specific agent in mind that could benefit from red teaming and safety guardrail evaluation? Request to speak with our team here, we're happy to get into the technical details with you.