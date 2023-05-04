With this shared understanding of what goes into effective test & evaluation of models, the question becomes: what is the optimal paradigm by which T&E should be institutionally implemented?

We view this as a question of localizing the necessary ecosystem components and interaction dynamics across four institutional stakeholder groups:

Frontier model developers, who innovate on the technological cutting edge of model capabilities Government, which is responsible for regulating the models’ use and development by all, and uses models for its own account Enterprises and organizations seeking to deploy the models for their own use Third party organizations which service the aforementioned three stakeholder groups, and support the ecosystem, via either commercial services or nonprofit work

Making sure that these players work harmoniously, toward democratic values, and in alignment with the greater social good, is paramount. This ecosystem is represented in the graphic below:'

The Frontier Model Developers

The role of the frontier model developers in the broader T&E ecosystem is to advance the state of the technology and push the bounds of AI’s potential, subject to safeguards and downside protection. These are the players which develop new models, test them internally, and provide them to consumer and/or organizational end users.

Doing this safely starts by ensuring that each new model version is subject to regression testing, as developers iterate on improvements. This is best done via a static set of test prompts, across known areas. At major model checkpoints, they will launch exploratory evaluations to gain a more comprehensive and thorough understanding of their model’s strengths and weaknesses, which includes targeted red teaming from experts. Finally, once a model is ready for release, model developers will launch certification tests, which are standardized across various categories of risk or end use (e.g. bias, toxicity, legal or medical advice, etc.), with fewer in-depth insights, but resulting in an overall report card of model performance.

In order to ensure that all model developers are benefitting from shared learnings, there should also exist an opt-in red teaming pooling network for model developers, facilitated by a third party, which conducts red teaming across all models, aggregates red teaming results from internal teams at the model developers (and the public, where applicable), and alerts each participant developer of any novel model vulnerabilities. This is valuable because research has demonstrated that these vulnerabilities may at times be shared across models from different developers (see “Universal and Transferable Adversarial Attacks on Aligned Language Models”). At the red teaming expert level, this model should compensate participants on the basis of value attribution, from what they are able to discover and contribute, not dissimilarly from traditional software bug bounty programs.

Government

The role of government in the T&E ecosystem is twofold:

Establishment of clear guidelines and regulations, on a use case basis, for model development and deployment by enterprises and consumers Establishment and adoption of standards on the use of frontier models within the government itself

The more important of these two roles is the former, as a regulator and enforcer of standards. Debates have been ongoing of late as to how to best regulate AI as a category, and the manner by which legislators should seek to balance the macro version of the helpfulness vs. harmlessness tradeoff—that is, in adopting more restrictive legislation which seeks to avoid all potential harms, vs. lighter guardrails which optimize for technological and economic progress.

We believe that proper risk-based test & evaluation prior to deployment should represent a key cornerstone for any legislative structure around AI, as it remains the best safety mechanism we have for production AI systems. It is also important to remember that determining a reasonable risk tolerance for large language models depends significantly on the intended use case, and it is for that reason that legislatively centralizing novel standards and their enforcement for AI beyond general frameworks is extremely difficult. However, we should absolutely leverage our existing federal agencies, each with valuable domain specific knowledge, as forces for regulating the testing, evaluation, and certification of these models at a hyper-specific, use case level, where risk level can be appropriately and thoughtfully factored in.

There should consequently exist a wide variety of new model certification regulatory standards, industry by industry, which government helps craft in order to ensure the safety and efficacy of model use by enterprises and the public.

Separately, as the US Federal Government and its approximately 3 million employees adopt many of these new frontier models themselves, they will simultaneously need to adopt T&E mechanisms to ensure responsible, fair, and performant usage. These will largely overlap with the mechanisms employed by enterprises as described below, but with some notable differences on the basis of domain—e.g. the Department of Defense will need to leverage T&E systems to ensure adherence to its Ethical AI Principles, or any comparable standards released in the future, and will need to optimize for unique concerns such as the leaking of classified information.

In many cases, to keep up with the pace of innovation, effective operational T&E within the government will require contracting with a third party expert organization. This is precisely why Scale is proud to serve our men and women in all corners of government via cutting edge LLM test & evaluation solutions developed alongside frontier model developers.

Enterprises

As the conduit for the majority of end model usage, the role of enterprises in the T&E ecosystem beyond the work done by the model developers (and often for uses and extensions unforeseen by the original developers) is equally important.

As enterprises leverage their proprietary data, domain expertise, use cases, and workflows to implement AI applications both internally and for their customers and users, there needs to be constant production performance monitoring. This monitoring should allow for escalation to human expert reviewers when automatically flagged examples which are outliers in existing T&E datasets arise.

And finally, as enterprises start to, in a smaller way, become model developers themselves by fine-tuning open source models (such as via Scale’s Open Source Model Customization Offering), they or the fine tuning providers they work with will need to adopt many of the same T&E procedures as the frontier model developers, including model eval and expert red teaming.

The notable difference for enterprise T&E will be the existence of industry- and use case-specific standards for model performance, which will be critical in ensuring responsible, fair, and performant use of these models in production. Certain enterprises will establish their own internal performance standards, but above and beyond that there need to exist standards on the models’ use enforced by regulatory bodies in the relevant domains, as discussed above. The achievement of these standards should be adjudicated on a regular cadence by a third party organization, and be recognized by the bestowment and maintenance of official certifications, as is the case for certain information security certifications today.

Third Party Organizations

Within this model, the fourth and final group is the set of third party organizations which contribute to this ecosystem by supporting the aforementioned three classes of stakeholders. These encompass academic and research institutions, nonprofits and alliances, think tanks, and commercial companies which service this ecosystem.

Scale falls into this final group, as a provider of human and synthetic-generated data and fine tuning services, automated LLM evaluations and monitoring, and most importantly, expert human LLM red teaming and evaluation, to developers and enterprises. Scale also acts as a third party provider for both model T&E and end user AI solutions, to the many public sector departments, agencies, and organizations which we proudly serve.

The roles of these parties may vary from policy thinking to sharing of industry best practices, and from providing infrastructure and expert support for the effective execution of model T&E to establishing and maintaining performance benchmarks. There will need to exist a diverse and robust set of organizations in order to properly support T&E.