Product

Building Trust in AI: Our vision for test and evaluation

byon August 11, 2023

Today, Scale is announcing our Test & Evaluation offering for LLMs, which is already being leveraged by OpenAI, and is available in early access to select customers who are at the forefront of LLM development.

Since large language models broke into the public consciousness, the conversation has largely fallen into two categories: The exciting realm of new possibilities this technology can unlock, and the risks that could come from opening Pandora’s box. 

While many would like to paint these as two opposite ends of the spectrum, the reality is that progress in frontier model capabilities must happen alongside progress in model evaluation and safety. This is not only a moral imperative, but a practical one. In order to advance our knowledge and understanding of LLMs – and therefore advance their capabilities – we must understand their weaknesses as much as their strengths. 

Various studies have found popular LLMs hallucinate or provide inaccurate information ranging anywhere from 20% to 52% of the time. As AI is increasingly being leveraged by newsrooms, misinformation has the potential to sway global markets and global politics. Other issues like unqualified advice have the potential to lead to real-world harms, such as providing false medical information, feeding into mental disorders, and more. LLMs can even be tricked into writing malicious code, opening new vectors for cybersecurity attacks. 

For AI to reach its full potential, it’s critical the AI industry is working to address these known challenges in parallel with other advancements to their models. 

Our Approach to Test & Evaluation

At Scale, for more than 7 years we have built up deep expertise in the entire lifecycle of developing, fine-tuning, testing and evaluating AI models. Our role as the data partner for companies like OpenAI, Meta, Microsoft, and Anthropic have given us a purview across the ecosystem that no other company has. It has allowed us to deeply understand the common challenges that must be addressed, how various players across the industry are working to address them, and ultimately develop our own technology to effectively test and evaluate LLMs. 

Today, there is no standard for test and evaluation (T&E) in AI. Our leadership in this space is why in May, the White House announced Scale was selected to develop the T&E platform for DEF CON’s public evaluation of models built by Anthropic, Cohere, Google, Hugging Face, Meta, NVIDIA, OpenAI, and Stability. 

We’re excited to see the industry come together for this moment, but this is just one step in the right direction. That’s why today, we are releasing our stance on Test & Evaluation to help set the standard for what an effective T&E approach looks like. Like other realms of cybersecurity, we need to share best practices and methodologies and learn from each other. The industry needs to create standards that companies can follow, and be held to. We hope this is just the start of more collaboration across the industry. 

Whether you’re a builder or buyer, we want to enable startups, enterprises, and government agencies to test and evaluate AI for safety and security. Scale’s methodology for a robust T&E process and ecosystem spans three key pillars:

  1. Hybrid model evaluation – conducted through a combination of automated evaluations and human expert evaluations to assess model capability and helpfulness over time.
  2. Hybrid red-teaming – in-depth targeting by human experts of specific harms and techniques where a model may be weak, in order to elicit undesirable behavior. This behavior can then be cataloged, added to an adversarial test set for future tracking, and patched via additional tuning.
  3. T&E ecosystem – the paradigm by which T&E should be institutionally implemented. We view this as a question of localizing the necessary ecosystem components and interaction dynamics across four institutional stakeholder groups: Frontier model developers; government; enterprises; and third-party organizations

Within this approach, we break down what T&E is, how to do it, and how to implement it within the broader ecosystem. 

Read our full approach to Test & Evaluation here

Our Test & Evaluation Offering

Enabling effective and secure generative AI will require a comprehensive T&E offering, encompassing model evaluation, monitoring, and red teaming, as demonstrated in our approach to Test and Evaluation. 

That’s why, today, we are excited to be expanding on our critical safety and evaluation work with leading model builders and announcing early access availability for Scale’s first T&E offering, which is already being leveraged by OpenAI for red teaming their models. 

This offering will be available to select customers who are at the forefront of LLM development, as it is critical we work together to ensure we are keeping the AI community in mind as we build. 

Key components of our offering include: 

  1. Continuous Evaluation - continuously evaluate and monitor the performance of your AI systems
  2. Red Teaming - identify severe risks and vulnerabilities for AI systems
  3. AI System Certification - certify AI applications for safety and capability

To learn more, please visit the product page

Together, we hope to enable safe, secure AI for all.


The future of your industry starts here.