As Scale works with governments and the leading companies around the world, one of the top questions that we are asked is, “How do we ensure that AI is safe to deploy?”
Based on our years of work with the leading frontier model developers and enterprises deploying AI, we believe the best approach is to implement a risk-based test and evaluation framework to ensure that the AI is safe for its intended use case. For decades, test and evaluation has been a standard part of a product development cycle to ensure that products meet the necessary safety requirements to be brought to market or acquired by the government. While it may be most recognized in heavily regulated industries like aviation or medicine, test and evaluation takes place within every industry.
For generative AI, such as large language models (LLMs), test and evaluation, especially red teaming, has the potential to help us understand bias in the models, weed out vulnerabilities, prevent algorithmic discrimination, and ensure that LLM developers and adopters understand the opportunities and limitations of their models.
Today, given that LLMs are still relatively new, the methodologies and industry-wide standards to govern test and evaluation for generative AI are still being established. To help move the conversation forward, Scale recently published the industry’s first comprehensive technical methodology for scalable and effective LLM test and evaluation based on the best practices we have garnered from our work across the industry.
Scale’s approach features a hybrid method that combines both an automated assessment with human evaluation, resulting in a comprehensive, scalable process. While there is still much work to be done to drive industry consensus around a test and evaluation approach, and codify the standards necessary for test and evaluation of generative AI, this is a critical first step to move AI safety forward.
Here are the key takeaways from Scale’s methodology:
Read the full white paper here and our vision for test and evaluation here.