• white-house
  • openai
  • CDAO

Test & Evaluate

What is Large Language Model Test & Evaluation?

Continuous Evaluation

Continuously evaluate and monitor the performance of your AI systems.

Red Teaming

Identify severe risks and vulnerabilities for AI systems.

AI System Certification

Forthcoming certifications of AI applications for safety and capability against the latest standards.

Get Early Access

Understand LLM Capabilities, Risks, and Vulnerabilities

Gallery background

Approach

Our Approach to Hybrid Test & Evaluation

Hybrid Red Teaming

Red teaming seeks to elicit undesirable model behavior as a way to assess safety and vulnerabilities. The most effective red teaming pairs automated attack techniques with human experts across a diverse threat surface area.

Red-Teaming

Hybrid Model Evaluation

Continuous model evaluation is critical for assessing model capability and helpfulness over time. A scalable hybrid approach to evaluation leverages LLM-based evaluation combined with human insights where they are most valuable.

Model-Evaluation

Ecosystem

An holistic and effective approach to model test and evaluation requires participation and coordination from a broad ecosystem of institutional stakeholders.

Ecosystem
section bg

Risks

Key Identifiable Risks of LLMs

Experts

Expert Red Teamers

expert-red-teamers
  • Red Team

  • Technical

  • Defense

  • Creatives

Trusted

Trusted by Federal Agencies and World Class Companies

microsoft
Automated systems should be developed with consultation from diverse communities, stakeholders, and domain experts to identify concerns, risks, and potential impacts of the system. Systems should undergo pre-deployment testing, risk identification and mitigation, and ongoing monitoring that demonstrate they are safe and effective based on their intended use, mitigation of unsafe outcomes including those beyond the intended use, and adherence to domain-specific standards.

Blueprint for an AI Bill of Rights

Office of Science and Technology Policy, White House

openai
Robust red-teaming is essential for building successful products, ensuring public confidence in AI, and guarding against significant national security threats. Model safety and capability evaluations, including red teaming, are an open area of scientific inquiry, and more work remains to be done.

Moving AI governance forward

OpenAI

Trusted by the world’s most ambitious AI teams.Meet our customers

  • usarmy
  • cdao
  • airforce
  • defense innovation unit
  • openai
  • meta
  • cohere

RESOURCES

Learn More About Our LLM Capabilities

SEE IT IN ACTION

Visually Explore Embeddings

Get Started Today