Test & Evaluation: The Right Approach for Safe, Secure, and Trustworthy AI
As Scale works with governments and the leading companies around the world, one of the top questions that we are asked is, “How do we ensure that AI is safe to deploy?”
Based on our years of work with the leading frontier model developers and enterprises deploying AI, we believe the best approach is to implement a risk-based test and evaluation framework to ensure that the AI is safe for its intended use case. For decades, test and evaluation has been a standard part of a product development cycle to ensure that products meet the necessary safety requirements to be brought to market or acquired by the government. While it may be most recognized in heavily regulated industries like aviation or medicine, test and evaluation takes place within every industry.
For generative AI, such as large language models (LLMs), test and evaluation, especially red teaming, has the potential to help us understand bias in the models, weed out vulnerabilities, prevent algorithmic discrimination, and ensure that LLM developers and adopters understand the opportunities and limitations of their models.
Today, given that LLMs are still relatively new, the methodologies and industry-wide standards to govern test and evaluation for generative AI are still being established. To help move the conversation forward, Scale recently published the industry’s first comprehensive technical methodology for scalable and effective LLM test and evaluation based on the best practices we have garnered from our work across the industry.
Scale’s approach features a hybrid method that combines both an automated assessment with human evaluation, resulting in a comprehensive, scalable process. While there is still much work to be done to drive industry consensus around a test and evaluation approach, and codify the standards necessary for test and evaluation of generative AI, this is a critical first step to move AI safety forward.
Here are the key takeaways from Scale’s methodology:
- Comprehensive test and evaluation requires testing of both the model capabilities–defined as a model’s ability to follow instructions responsibly and factually–as well as stress testing of the safety risks associated with the model.
- Scalable test and evaluation can be greatly accelerated with intelligent incorporation of state-of-the-art automated methods for both capability assessment and vulnerability detection, but will still require humans in the loop to ensure accuracy in the most complex situations. We can optimize overall evaluation efficiency without sacrificing accuracy by assessing when it is safe to trust automated approaches through computing quantitative estimates of evaluation model confidence.
- A test and evaluation framework needs to be flexible enough to adapt to the risk associated with the intended use case. For example, AI for medical use cases, such as those governed by the FDA, should require more rigorous test and evaluation than AI being used for generating poetry.