• white-house
  • openai
  • CDAO

Test & Evaluate

What is Large Language Model Test & Evaluation?

Continuous Evaluation

Continuously evaluate and monitor the performance of your AI systems.

Red Teaming

Identify severe risks and vulnerabilities for AI systems.

AI System Certification

Forthcoming certifications of AI applications for safety and capability against the latest standards.

Get Early Access

Understand LLM Capabilities, Risks, and Vulnerabilities

Gallery background


Our Approach to Hybrid Test & Evaluation

Hybrid Red Teaming

Red teaming seeks to elicit undesirable model behavior as a way to assess safety and vulnerabilities. The most effective red teaming pairs automated attack techniques with human experts across a diverse threat surface area.


Hybrid Model Evaluation

Continuous model evaluation is critical for assessing model capability and helpfulness over time. A scalable hybrid approach to evaluation leverages LLM-based evaluation combined with human insights where they are most valuable.



An holistic and effective approach to model test and evaluation requires participation and coordination from a broad ecosystem of institutional stakeholders.

section bg


Key Identifiable Risks of LLMs


Expert Red Teamers

  • Red Team

  • Technical

  • Defense

  • Creatives


Trusted by Federal Agencies and World Class Companies

Automated systems should be developed with consultation from diverse communities, stakeholders, and domain experts to identify concerns, risks, and potential impacts of the system. Systems should undergo pre-deployment testing, risk identification and mitigation, and ongoing monitoring that demonstrate they are safe and effective based on their intended use, mitigation of unsafe outcomes including those beyond the intended use, and adherence to domain-specific standards.

Blueprint for an AI Bill of Rights

Office of Science and Technology Policy, White House

Robust red-teaming is essential for building successful products, ensuring public confidence in AI, and guarding against significant national security threats. Model safety and capability evaluations, including red teaming, are an open area of scientific inquiry, and more work remains to be done.

Moving AI governance forward


Trusted by the world’s most ambitious AI teams.Meet our customers

  • usarmy
  • cdao
  • airforce
  • defense innovation unit
  • openai
  • meta
  • cohere


Learn More About Our LLM Capabilities


Visually Explore Embeddings

Get Started Today