Field Unknown Model.Unknown Field


Company Updates & Technology Articles

February 20, 2024


Scale AI Partners with DoD’s Chief Digital and Artificial Intelligence Office (CDAO) to Test and Evaluate LLMs

Scale AI, the leading test and evaluation (T&E) partner for frontier artificial intelligence companies, is proud to share that we are partnering with the U.S. Department of Defense’s (DoD) Chief Digital and Artificial Intelligence Office (CDAO) to create a comprehensive T&E framework for the responsible use of large language models (LLMs) within the DoD.

Through this partnership, Scale will develop benchmark tests tailored to DoD use cases, integrate them into Scale’s T&E platform, and support CDAO’s T&E strategy for using LLMs. The outcomes will provide the CDAO a framework to deploy AI safely by measuring model performance, offering real-time feedback for warfighters, and creating specialized public sector evaluation sets to test AI models for military support applications, such as organizing the findings from after action reports.

This work will enable the DoD to mature its T&E policies to address generative AI by measuring and assessing quantitative data via benchmarking and assessing qualitative feedback from users. The evaluation metrics will help identify generative AI models that are ready to support military applications with accurate and relevant results using DoD terminology and knowledge bases.The rigorous T&E process aims to enhance the robustness and resilience of AI systems in classified environments, enabling the adoption of LLM technology in secure environments.

Alexandr Wang, founder and CEO of Scale AI, emphasized Scale’s commitment to protecting the integrity of future AI applications for defense and solidifying the U.S.’s global leadership in the adoption of safe, secure, and trustworthy AI. “Testing and evaluating generative AI will help the DoD understand the strengths and limitations of the technology, so it can be deployed responsibly. Scale is honored to partner with the DoD on this framework,” said Wang.

For decades, T&E has been standard in product development across industries, ensuring products meet safety requirements for market readiness, but AI safety standards have yet to be codified. Scale’s methodology, published last summer, is one of the industry’s first comprehensive technical methodologies for LLM T&E. Its adoption by the DoD reflects Scale’s commitment to understanding the opportunities and limitations of LLMs, mitigating risks, and meeting the unique needs of the military. 

Learn more about Scale’s approach to test and evaluation at

Read more

February 13, 2024


Accelerate Generative AI Across Your Enterprise with Scale GenAI Platform

2023 ushered in a wave of excitement about Large Language Models (LLMs) and became the year of the Generative AI proof-of-concept. Enterprises experimented with Generative AI and explored how it may impact their business. According to BCG, Generative AI solutions can deliver up to 50% efficiency and effectiveness gains. However, only 10% of enterprises actually have Generative AI models in production.

Throughout the experimentation process, many enterprises learned that out-of-the-box, generative models are not accurate enough at domain-specific tasks, nor do the models have access to proprietary company data. These companies are turning to model customization via fine-tuning and retrieval augmented generation (RAG) to solve this. This model customization enables them to improve performance on domain-specific tasks, maximize ROI through more capable solutions and reduced token usage, and increase confidence in model reliability, safety, and accuracy.

However, many enterprises lack the expertise, tools, and framework needed to build customized Generative AI models and applications at scale. Capturing value from those Generative AI solutions and accelerating that customization capability across an organization is even more challenging. This is why we built Scale GenAI Platform to help enterprises make more cost-effective investments and more easily build, test, and deploy customized Generative AI applications. By leveraging Scale GP, companies can make 2024 the year of deploying GenAI apps to production and creating real business value.

We wanted to not just stand up a demo or POC, but deploy production-ready infrastructure for an initial use case as a foundation for expansion. With Scale GenAI Platform, we were able to quickly start our first use case: a GenAI solution that makes it easy for users across Global Atlantic to get information from our Enterprise Data Hub using natural language. This will help enable data-driven decision making, shortening the time to insights from days or weeks down to seconds.

Padma Elmgart, CTO, Global Atlantic Financial Group


Scale GenAI Platform

With Scale GenAI Platform, customers use their proprietary data to customize GenAI applications. Our customers are building use cases like:

  • Content-generation systems that enable sales teams to be more effective and efficient.

  • Highly customized wealth management copilots that make advisors more effective by helping them tap into their knowledge bases quickly and accurately.

  • Text2SQL business intelligence applications to make analysts more efficient and embed a culture of data-driven decision-making.

These sophisticated organizations regularly operate at the cutting edge of technology. Yet, even these companies found that they did not have the infrastructure and tools to build enterprise-ready Generative AI applications.

To succeed in their Generative AI journey, we learned that companies need the following:

  • Custom models built with proprietary data & experts

  • Focus on expanding to more use cases, not building infrastructure

  • Flexibility in foundation model selection and cloud service provider

  • Test and Evaluation to maximize performance and ensure safe and responsible AI

Scale GP enables companies to accelerate their Generative AI journeys and create real business value from their investments.

Custom models built with proprietary data & experts

Enterprise data is often spread across the organization in many different data stores, formats, and with varying degrees of accessibility. This data is often poorly formatted, contains inaccuracies, or is incomplete. Fine-tuned foundation models are extremely sensitive to low-quality data and even one example of poor data can make the difference between a model that is capable for a specific use case or is completely useless.

With Scale GenAI Platform, customers tap into Scale’s industry-leading data expertise by leveraging the Scale Data Engine to transform their proprietary data and generate the highest quality training data for their use cases. Scale then uses this training data to deliver fine-tuned models tailor-made for their unique use cases. Combined with our advanced Retrieval Augmented Generation (RAG) tools, customers can build applications that reference and cite their knowledge base for more accurate responses.

Focus on expanding to more use cases, not building infrastructure

Enterprise customers want to get ROI from their Generative AI investments quickly, so they need to accelerate their ability to customize, build, and deploy Generative AI applications. To do this consistently across their organizations is difficult without centralized infrastructure, which is time and resource-intensive to build.

GenAI Platform does all the heavy lifting by providing streamlined and centrally managed infrastructure to accelerate use cases into production and effortlessly scale up the number of Generative AI applications across the enterprise.

Flexibility in foundation model selection and cloud service provider

Enterprises need the flexibility to keep up with the rapidly developing trends in Generative AI and want to avoid lock-in with a solution or provider that cannot consistently keep pace.

Some enterprises use closed-source models like OpenAI’s GPT-4 or Cohere’s Command model, while others opt for open-source models like Meta’s Llama 2. GenAI Platform supports all major open and closed-source foundation, embedding, and reranking models, including GPT-4 and Llama 2. We are also excited to announce that we have now added Cohere’s Command model and rerank technology to GenAI Platform for fine-tuning, inference, and use in RAG workflows.

Similarly, some customers are on AWS, while others are on Azure, Google Cloud Platform, or have a multi-cloud strategy. We built GenAI Platform so our customers can securely customize and deploy enterprise-grade Generative AI Applications in their own VPC, including AWS and Azure. And we are excited to announce that we will soon be coming to the Azure Marketplace.

Test and Evaluation to maximize performance and ensure safe and responsible AI

Like traditional software applications, organizations must test Generative AI applications to ensure they work as intended. Our customers need test and evaluation (T&E) to be confident that their models perform well and are safe and responsible. However, the tooling, processes, and human expertise for testing Generative AI applications are not widely available today.

The T&E features of GenAI Platform enable our customers to be confident in their customized models with human-in-the-loop testing, evaluation, and monitoring.

Our partnership with Scale helped us build robust GenAI custom solutions for our clients, cutting time-to-market in half. Combining BCG's deep sector and functional experience and focus on value with Scale's proven platform and engineering depth in GenAI, we are uniquely differentiated to help companies realize value quickly with GenAI. This includes customized, multi-model, and production-grade solutions on a scalable multi-cloud infrastructure. We're excited to continue to bring these capabilities to market.

Vladimir Lukic, Managing Director & Senior Partner; Global Leader, Tech and Digital Advantage, BCG



2023 was the year of the Generative AI POC, and we believe 2024 is the year of deploying Generative AI applications to production – delivering real business value. With Scale GenAI Platform, it is now possible to accelerate your Generative AI journey and equip your entire organization to customize, build, test, and deploy enterprise-ready Generative AI models and production applications. Learn more about GenAI Platform here or book a demo below to start today.


Read more

February 8, 2024


Scale AI Joins U.S. Artificial Intelligence Safety Institute Consortium

Scale AI is proud to announce a new collaboration with the National Institute of Standards and Technology (NIST) in the Artificial Intelligence Safety Institute Consortium (AISIC) to develop science-based and empirically backed guidelines and standards for AI measurement and policy, laying the foundation for AI safety across the world.

The newly formed AISIC unites the leading AI companies and developers, academics, government and industry researchers, and civil society organizations in support of the development and deployment of safe and trustworthy AI, and will contribute to priority actions outlined in President Biden’s Executive Order, including developing guidelines for red teaming, capability evaluations, risk management, safety and security, and watermarking synthetic content. This effort will help ready the U.S. to address the capabilities of the next generation of AI models and systems with appropriate risk management strategies. 

“The U.S. government has a significant role to play in setting the standards and developing the tools we need to mitigate the risks and harness the immense potential of artificial intelligence. President Biden directed us to pull every lever to accomplish two key goals: set safety standards and protect our innovation ecosystem. That’s precisely what the U.S. AI Safety Institute Consortium is set up to help us do,” said Secretary Raimondo. “Through President Biden’s landmark Executive Order, we will ensure America is at the front of the pack – and by working with this group of leaders from industry, civil society, and academia, together we can confront these challenges to develop the measurements and standards we need to maintain America’s competitive edge and develop AI responsibly.”

Scale looks forward to working with NIST and other industry leaders to create the next set of methodologies to promote trustworthy AI and its responsible use. NIST has long been a leader in establishing industry-wide best practices and frameworks for the most innovative technologies. Scale applauds the Administration for its Executive Order on AI and the leadership at the Department of Commerce for recognizing that test & evaluation and red teaming are the best ways to ensure that AI is safe, secure, and trustworthy. In doing so, we not only contribute to the responsible use of AI, but also reinforce the United States’ position as the global leader in artificial intelligence. 

Learn more about Scale’s test & evaluation and red teaming initiatives here:


Read more

January 30, 2024


Unraveling the Mysteries of Inter-Rater Reliability

Imagine you have submitted a research paper to a leading conference in the field of AI. Several reviewers will assess your work, each providing a rating from a set of four categories: accept, weak accept, weak reject, and reject. These ratings will play a crucial role in determining whether your work will eventually be accepted or rejected. Ideally, all reviewers should give the same rating, indicating they are applying the rating criteria consistently. However, in practice, their ratings may vary due to their interpretations of the paper's motivation, implementation, and presentation.

To evaluate the variability in rates and ensure a consistent rating process, the concept of inter-rater reliability (IRR) comes into play. Inter-rater reliability refers to statistical metrics designed to measure the level of observed agreement while controlling for agreement by chance. These metrics have also been called inter-rater agreement, inter-rater concordance, inter-coder reliability, and other similar names. Inter-rater reliability is a crucial metric for data collection in many domains that require rater consistency. When observing a high inter-rater reliability, raters are interchangeable because the ratings are similar regardless of who gives them. When observing a low inter-rater reliability, rates vary by individuals to a certain degree. This could result from factors like poorly calibrated raters, vague rating definitions, subjectivity or inherently indeterminate in their ratings.

Inter-rater reliability is useful beyond paper reviews to fields like education, medical diagnosis, and psychometrics. It has also recently become a crucial tool in the development of large language models, primarily as a metric for estimating the quality of training data and assessing model performance. The wide range of applications showcases the extensive adaptability of inter-rater reliability. 

While IRR is a powerful statistical tool, it is not as well-known as related methods such as Pearson correlation, and it comes with its own complexities and nuances. This blog seeks to simplify the understanding of inter-rater reliability, offering easy-to-follow calculations and detailed explanations of its foundations. We'll begin with the fundamental concept of percentage agreement, then examine more intricate metrics, including Kappa coefficients and paradox-resistant coefficients. Topics such as validity coefficients and the role of inter-rater reliability in AI will also be discussed. We'll conclude by introducing how inter-rater reliability is employed in Scale’s quality evaluation system and the various initiatives undertaken to guarantee the utmost standard in data production.

Percentage Agreement

Let’s formulate a simple example to demonstrate metric calculation. Imagine we have 50 papers submitted to an AI conference and 2 raters (reviewers) to rate each subject (paper) as either accept or reject. The rating results are summarized in the table below. Both raters accepted the paper 15 times. Both raters rejected the paper 20 times. Rater A accepted and rater B rejected 11 times. Rater A rejected and rater B accepted 4 times.

If you need to assess how much two raters agree before knowing about IRR, a simple method is to look at how often the raters give the same ratings among all subjects. This approach is known as percentage agreement or observed percentage agreement. In our example, we determine the observed percentage agreement by calculating the fraction of subjects on which both raters agree, considering both accept and both reject, resulting in a value of 0.7. 

In contrast to the IRR metrics we discuss later, percentage agreement is easier to understand and visualize because it represents an actual percentage, not just as a number. For instance, in our example, a value of 0.7 translates to the two raters agreeing with each other 70% of the time.

Cohen’s Kappa

While observed percentage agreement is simple and easy to understand, it doesn’t account for the fact that raters could agree by random chance. Imagine a situation where each rater flipped a coin to decide whether to accept or reject each paper, two raters could agree to a considerable amount which would yield a positive observed percentage agreement.

In 1960, the statistician and psychologist Jacob Cohen, best known for his work in statistical power analysis and effect size, proposed the Kappa coefficient. Cohen’s Kappa uses an estimated percent chance agreement (denoted by Pₑ) to adjust the observed agreement (denoted by Pₐ). Cohen’s Kappa is calculated by comparing observed agreement Pₐ to the perfect agreement of 1 while deducting both by estimated chance agreement Pₑ. This formula has also evolved into a foundational framework for creating other inter-rater reliability metrics, establishing their range from negative infinity to 1.

The calculation of the observed agreement Pₐ is performed in the same way as detailed in the previous chapter. In estimating percent chance agreement Pₑ, Cohen’s Kappa assumes that each rater randomly decides whether to accept or reject. It uses a rater’s observed marginal P(accept) as a rater’s propensity to accept. In our example, rater A accepted 26 papers and rejected 24 papers in total. Hence, its propensity to accept is assumed to be 26/50. On the other hand, A’s propensity to reject is 24/50. Similarly, the rater B’s propensity to accept is 19/50, and the propensity to reject is 31/50.

Cohen assumes each rater’s rating propensity is pre-determined before assessing the work. The estimated percentage chance agreement is the chance of two raters giving the same rate independently according to their rating propensity. Chance agreement could be either both raters accept a paper (A accept * B accept = 26/50 * 19/50) or both raters reject a paper (A reject * B reject = 24/50 * 31/50). The percentage chance agreement is the sum of two, which is 0.496. Finally, we apply the formula to calculate Cohen’s Kappa and get 0.405. 

Understanding Cohen's Kappa values, which range from negative infinity to 1, can be challenging. In 1977, Landis and Koch introduced a scale that categorizes Kappa values into specific levels of agreement, a system widely endorsed by researchers. According to this scale, values between (0.8, 1] suggest almost perfect agreement; (0.6, 0.8] suggest substantial agreement; (0.4, 0.6] suggest moderate agreement; (0.2, 0.4] suggest fair agreement; (0, 0.2] suggest slight agreement; and values less than 0 suggest poor agreement. For our example, a Cohen’s Kappa score of 0.405 falls into the category of moderate agreement. 

Fleiss’ Kappa & Krippendorff's Alpha

While Cohen's Kappa effectively measures 2-rater agreement, it is not designed for situations where there are multiple raters assessing the subjects. When involving multiple raters, agreement or disagreement is no longer binary. For example, with three raters, you could have two raters assigning the same rating while one assigns a different rating. This situation doesn't represent complete agreement or disagreement. Instead, the notion of agreement transforms into a spectrum.

In 1971, Joseph L. Fleiss, a distinguished statistician, built upon the groundbreaking work of Jacob Cohen by introducing an improved form of the Kappa coefficient. Named Fleiss' Kappa, this version expanded Cohen's original concept to effectively handle situations with multiple raters. Fleiss' Kappa is derived by applying the general formula that adjusts for chance agreement, leveraging Pₐ and Pₑ. Comprehending Fleiss’ Kappa hinges on understanding the calculation of these two parameters. 

Consider an example with 3 raters reviewing 3 subjects, choosing between ‘accept’ or ‘reject’. The observed percentage agreement, Pₐ, is calculated by determining the proportion of agreeing rater pairs out of all possible pairs for each subject, then averaging these proportions across all subjects. In our example, each subject has three rater pairs (AB, AC, BC), and we calculate the agreeing proportion as the observed agreement for each subject. Then we average these proportions to obtain an overall Pₐ.

To estimate the percentage chance agreement (Pₑ), Fleiss' Kappa assumes that a rater's choice of a particular rating equals the overall observed rate of that rating in the data. This method simplifies the problem by assuming uniform rate propensity across all raters, making Pₑ a measure of how often two raters would randomly agree independently. By inserting both Pₐ and Pₑ into the chance adjustment formula, the value of Fleiss' Kappa is derived.

Krippendorff's Alpha, developed by Klaus Krippendorff, is another popular measure of inter-rater reliability that shares a foundational similarity with Fleiss' Kappa. Generally, Krippendorff's Alpha tends to produce results that are close to those obtained from Fleiss' Kappa. The primary distinction of Krippendorff's Alpha lies in its capacity to accommodate scenarios with missing ratings, where not every rater evaluates each subject. Furthermore, it incorporates subject size correction terms, which lead to more conservative estimations, particularly when the number of raters falls below five.

Level of Measurement

In our example of paper review, there are two rating categories: 'accept' and 'reject.' These categories are distinct and do not intersect. In this case, the rating's level of measurement is classified as nominal, where 'accept' and 'reject' form a dichotomy. Different ratings represent complete disagreement.

If the rating categories are expanded to include 'accept', 'weak accept', 'weak reject', and 'reject', the level of measurement transitions to ordinal, reflecting an inherent order within the categories. This allows for the establishment of a hierarchical sequence from 'accept' through 'weak accept' and 'weak reject', to 'reject'. In this framework, 'weak accept' is perceived as a lower level of agreement with 'reject', rather than being a complete disagreement.

A specific subtype of the ordinal level of measurement is the interval level, characterized by equal distances between adjacent ratings. If the intervals between 'accept', 'weak accept', 'weak reject', and 'reject' are equal, our example could be classified as interval ratings. However, this might not be entirely appropriate as some may perceive the gap between 'weak accept' and 'weak reject' to be larger than the others, indicating a shift in the direction of the outcome. For a clearer understanding, the table below outlines the three levels of measurement we've covered, complete with relevant examples.

Level of Measurement Characteristic Example
Nominal Ratings without inherent order. Classifying diseases in medical diagnosis.
Ordinal Ratings have a meaningful order, but intervals aren't consistent. A 1-5 star rating in app stores, with 1 star typically indicating poor quality, significantly different from higher ratings.
Interval Equal intervals between ratings. IQ scores, where the difference between scores is designed to be consistent, indicating equal increments in intellectual ability.

To transition from nominal to other measurement levels, weights are applied to transform the binary framework of agreement or disagreement into varying degrees of agreement. Typically, an exact match in the rate category is assigned a weight of 1 and a partial match is assigned a weight below 1. The specific scale of these weights is dictated by the mathematical formulation used to measure the distance between ratings. For instance, the table below illustrates a linear weight approach, where the weight of agreement decreases linearly as the distance between ratings increases. Additionally, there are alternative weighting methods such as quadratic, ordinal, and more.

After selecting a weighting method, it should be consistently applied to both the observed percentage agreement (Pₐ) and the percentage chance agreement (Pₑ). Weighted Cohen’s Kappa, a variation of Cohen’s Kappa, incorporates these weights. As the field of inter-rater reliability evolves, incorporating weights has become a standard practice to encompass various levels of agreement in the metrics.

Kappa’s Paradox

While Kappa coefficients are designed to adjust for chance agreement in a straightforward manner, they can sometimes result in unexpectedly low values when compared to the observed percentage agreement (Pₐ). Take, for instance, a scenario where 10 cases of both 'accept' are shifted to both 'reject' in our 2-rater example. In this situation, Cohen’s Kappa significantly drops from 0.405 to 0.219, even though the observed agreement percentage remains constant at 0.7. 

This phenomenon, known as Kappa’s paradox, is primarily attributed to changes in the percentage chance agreement estimation. When the observed agreement of 0.7 is compared against a chance agreement of 0.496, it is seen as a good agreement, yielding a Cohen’s Kappa of 0.405. However, against a chance agreement of 0.616, the same observed agreement of 0.7 doesn't appear as strong, thus resulting in a lower Cohen’s Kappa of 0.219. Detailed calculations are presented in the illustration below. 

Kappa's paradox becomes particularly evident when observed ratings are skewed towards one or a few categories. In such cases, the percentage of chance agreement is often larger than anticipated, due to the data skewness. This can result in a Cohen’s Kappa value that is near zero or even negative, suggesting that the level of agreement is no better than, or even worse than, what would be expected by random chance. This outcome is paradoxical and often counterintuitive, as it contradicts the consistency level many researchers would perceive under these situations.

One explanation of Kappa’s Paradox stems from its approach to estimating the percentage of chance agreement. Remember that Cohen’s Kappa presumes each rater has a certain propensity to rate in a particular way, even before engaging in the rating process. This chance agreement is calculated by considering each rater’s propensity as an independent event. However, there's a catch: the estimation of each rater’s propensity is derived from their observed ratings, which are influenced by the subjects they rate. This reliance on observed ratings compromises the assumption of independent rating, leading to an imperfect estimation of chance agreement. Kappa’s paradox isn't unique to Cohen’s Kappa; both Fleiss’s Kappa and Krippendorff's Alpha encounter similar issues, as they also use observed marginal probabilities to approximate a rater’s propensity.

Paradox-Resistant Coefficients

In 1981, Robert Brennan and Dale Prediger introduced an agreement coefficient, arguably the simplest among all chance-corrected coefficients aimed at resolving Kappa’s paradox. The Brennan-Prediger coefficient is based on the premise that a rater's likelihood of choosing any particular rating is uniformly distributed across all available categories. Therefore, if there are 𝒒 categories, the probability of a rater selecting any one category is 1/𝒒. With this uniform distribution assumption, the expected chance agreement between two raters is calculated as 𝒒×(1/𝒒)² = 1/𝒒. The diagram below contrasts the chance agreement observed in the previous example with the chance agreement calculated using the Brennan-Prediger coefficient.

The Brennan-Prediger coefficient effectively circumvents Kappa’s paradox by consistently providing a constant level of chance agreement, regardless of whether the observed ratings are balanced or skewed. However, it is crucial to validate the assumption of uniform rating propensity when applying the Brennan-Prediger coefficient. In the context of our paper review example, a 50/50 propensity of accepting or rejecting seems more plausible if raters are strictly evaluating based on content criteria without any quotas for acceptance or rejection. Conversely, if a conference is known for its stringent acceptance rates and raters are subconsciously influenced to maintain a similar level of selectivity, then a 50/50 distribution would not be an accurate reflection of their rating propensity.

In 2008, Kilem Gwet developed a new agreement coefficient known as Gwet’s AC, specifically aimed at overcoming the challenges posed by Kappa’s paradox. Additionally, you may come across the terms AC₁ and AC₂ in this context, where AC₁ is used for nominal rating categories and AC₂ applies to other levels of measurement in ratings. 

Gwet’s theory conceptually categorizes subjects into two distinct types: textbook and non-textbook. Textbook subjects have deterministic rating categories, determined by the universally accessible and understandable public knowledge. Non-textbook subjects, conversely, are marked by their non-deterministic nature, where even the collective knowledge of raters fails to provide definitive answers.  Non-textbook subjects could also involve subjective judgment, where ratings are shaped by individual preferences. Textbook subjects are often called 'easy-to-rate', while their non-textbook counterparts are called 'hard-to-rate' due to their inherent complexity. Gwet proposed that chance agreement is particularly relevant in non-textbook subjects, as these often involve raters making decisions based on personal opinions or selecting from multiple viable options at random.

Ideally, distinguishing between textbook and non-textbook subjects allows for a more precise calculation of chance agreement, especially in non-textbook subjects where greater randomness is expected. However, classifying subjects into these categories is not straightforward and demands extensive understanding of the domain knowledge related to the rating task. Gwet's approach involves estimating the probability of a subject being non-textbook based on the observed rate of disagreement, under the assumption that disagreements are more likely to occur in non-textbook cases. While the precise mathematical formulation of Gwet’s AC is complex and beyond the scope of this blog, a key takeaway is that data skewness towards a few ratings would reduce the likelihood of encountering non-textbook subjects. The underlying assumption is that hard-to-rate subjects are rare when most subjects are assigned into one rating category. This feature effectively constrains the level of chance agreement when ratings are skewed.

To mitigate Kappa’s paradox, another empirical strategy is diversifying the subject pool, although its suitability varies with the research context. Theoretically, a diverse subject pool across the rated dimension should result in balanced rating categories, thereby avoiding Kappa’s paradox. To demonstrate, we simulated two scenarios: one rating distribution predominantly featuring papers likely to be rejected and another with a more even mix of papers likely to be accepted or rejected. The results show that Kappa and paradox-resistant coefficients heavily diverge in the skewed scenario but align in the more balanced rating scenario. In practice, leveraging a set of inter-rater reliability metrics can achieve a more reliable agreement evaluation, adaptable to data distribution.


Related Topics In Practice

Inter-rater reliability is a crucial metric for assessing consistency in ratings, but it does not encompass the entirety of data quality assessment, particularly in the context of validity. When a 'true' rate category is identifiable, validity becomes a measure of the accuracy of ratings. It is important to recognize that reliability and validity may not always align, as consensus among raters doesn't guarantee correctness. Inter-rater reliability metrics can be adapted to only consider consensus on the 'true' category, thereby serving as a validity coefficient. Operationally, the determination of the 'true' rating category typically involves setting a clear definition and consulting experts for their assessments.

Validity isn't always applicable, particularly when a clear 'true' rate category is absent. For instance, in subjective scenarios like rating a fitness app, there's unlikely to be a universally correct rating due to personal preference. Without a definitive 'true' rate set by an operational definition, the concept of validity loses relevance. In such scenarios, it's practical to view raters as representing diverse segments of a larger population, which impacts the expectations and interpretations of inter-rater reliability. 

The metrics for inter-rater reliability we've discussed mainly apply to categorical data, often seen in classifications or Likert scale ratings. For continuous data, where ratings might include decimal points, intra-class correlation coefficients (ICC) are more suitable for assessing reliability. Continuous data introduces the possibility of random noise, as even well-aligned raters may have slight differences in their ratings. Intra-class correlation offers a framework to model variations due to the rater, the subject, and noise, providing input to calculate inter-rater reliability in such contexts.

In the field of AI and Large Language Model (LLM), inter-rater reliability is increasingly used, especially in evaluating human-annotated training datasets and in assessing human evaluations of model performance. Notable applications include Google's use of Krippendorff's Alpha for dataset assessment and Meta's use of Gwet’s AC in model performance evaluation. In our research on published studies, we generally observed a higher inter-rater reliability in dataset annotation than in model evaluation. This is likely attributed to the increased subjectivity inherent in the model evaluation process. Furthermore, difficulty in rating a subject might impact IRR, as Anthropic observed that more sophisticated topics get lower agreement. There's also a growing call in the research community to disclose data collection specifics and IRR measures, as they shed light on data quality and subject difficulty, offering valuable insights for other researchers.

We have also witnessed innovative applications of IRR beyond data annotation and model evaluation. For instance, Stability AI employed rater engagement and IRR in choosing between various user interface and rate category designs for their web application, leading to more data collected with higher agreement. Moreover, a Stanford research utilized IRR to understand the alignment between human and language model, particularly in when to initiate grounding actions in a conversation. These expanding applications indicate a growing importance of inter-rater reliability in the ongoing development of AI.


Hopefully this blog has offered a clear and engaging look into the world of inter-rater reliability. Next time you're involved in academic submissions or peer reviews, you might have a deeper understanding on how ratings influence final outcomes and the role of inter-rater reliability. Here are some key takeaways:

  • Inter-rater reliability quantifies agreement among independent raters who rate, annotate, or assess the same subject.

  • Inter-rater reliability is a statistical tool designed to correct for chance agreement. It is applicable in multi-rater contexts and flexible across diverse levels of measurement.

  • Be cautious about Kappa’s paradox in cases of skewed rating distributions. Consider using both Kappa coefficients and paradox-resistance coefficients for a robust evaluation.

  • Agreement doesn't inherently imply accuracy. Validity measurement is essential alongside reliability when a definitive 'true' rating category exists and can be clearly identified.

  • The field of AI is increasingly adopting inter-rater reliability, particularly for assessing human-annotated training data and evaluating model performance.

At Scale, we have created a system for evaluating data production quality that integrates various indicators of quality. We assess the consistency of ratings using IRR, ensuring alignment with our anticipated standards, which vary based on the level of subjectivity of the rating task. We utilize 'golden' tasks, which are rated by experts, to ensure the validity of the data and to guardrail the competency of the data contributors. We perform linguistic analysis to measure the data's diversity in various aspects. We also use machine learning to assess raters' adherence to guidelines and their writing abilities, and to create tools for correcting grammar and syntax errors during annotation. By combining and validating these different quality signals, we establish a comprehensive and dependable system for assessing data quality.

Scale’s commitment to quality doesn't end with evaluation. We proactively apply insights from quality evaluations by regularly enhancing our rater training programs, making rating categories clearer, and engaging top professionals in data contribution. These combined efforts reflect Scale’s dedication to upholding exceptional data quality standards, ensuring we deliver superior data for advanced AI applications.

Looking ahead, Scale remains deeply invested in research and innovation with statistical methods to ensure the highest quality standards for our products and services. As we continue to push the boundaries of quality evaluation systems, we invite you to collaborate with us on this journey to power the world’s most advanced LLMs. 

Contact our sales team to explore how Scale’s Data Engine can help accelerate your data strategy.

A special thanks to the individuals below for their insightful feedback and suggestions for this blog post.

Scale AI: Dylan Slack, Russell Kaplan, Summer Yue, Vijay Karunamurthy, Lucas Bunzel

External: David Stutz, David Dohan


Read more

January 24, 2024


2024: The Year of AI Implementation and Legislation

As we step into 2024, it is clear that artificial intelligence (AI) will continue to dominate global government discussions. Last year saw over 100 new requirements for the federal governments from the Executive Order, OMB implementation memo, and NDAA. Specifically they recognized that AI-ready data is a national asset, initiated AI standards development work through the launch of the US AI Safety Institute, and established provisions that require external Test and Evaluation (T&E) before the government procures AI. Scale strongly supports these developments because they are critical steps to ensure that AI is safe, secure, and trustworthy for its intended use cases.

Government officials are now focused on implementation and legislation, aiming to build on the robust foundation established in 2023 and sustain the momentum. 


More than 100 new requirements must be implemented over the next year or so and these requirements are in addition to the existing ones from the two previous Executive Orders and bipartisan legislation. This has provided agencies no shortage of items to prioritize, positions to create, and work products to kick off. As they work through prioritization, the following three items should be top of mind because they underpin the adoption of responsible AI.

  • Test and Evaluation. The Executive Order, accompanying OMB Agency Implementation Memo, and NDAA all included provisions requiring external T&E prior to government procurement of AI. The federal agencies must now determine how to best implement comprehensive AI T&E by August 1, 2024 to ensure that AI is safe to be deployed on government networks. 

  • Standards. Initial work establishing AI best practices and frameworks must be transitioned into Standards Development Organizations. The newly announced U.S. AI Safety Institute, which will be housed within the Department of Commerce, will be critical to build the frameworks necessary to advance AI trustworthiness and will underpin safety techniques such as red teaming and T&E. 

  • Chief AI Officers. The Executive Order establishes the positions of Chief AI Officers at every agency. This is an important step forward to ensuring the efficient adoption of AI at all federal agencies and that the over 700 identified use cases can be carried out. However, on day one, it will be critical that the newly appointed person correctly prioritizes items like AI-ready data strategies to lay the foundation for successful AI adoption. 


One of the biggest questions is what Congress will do this year related to AI governance. The Executive Order and accompanying OMB Implementation Memo established a strong foundation for the United State’s approach to AI governance, but gaps still exist around critical topics like commercial AI safety. Key pieces of the EO must be funded and codified to be fully implemented. Thanks to the leadership of key Members of Congress, AI has remained a bipartisan issue that everyone recognizes that the United States must lead on. To maintain American leadership in AI, it will be critical that Congress works on three key topics this year:

  • Shifting from Learning to Legislating. Since Senate Leader Chuck Schumer’s announced that AI is a key issue in April 2023, Congress has prioritized learning about the complexities of AI. This involved hearings, roundtables, hands-on demonstrations, and Insight Forums involving many of the leading technologists and visionaries in the field. As a longtime builder and expert in the AI spaces, Scale supported these educational objectives through building the testing & evaluation platform for the generative red team efforts at DEF CON31, providing expert testimony to Congress, sharing our expertise at the Insight Forums, and providing Congressional members and their staff hands-on red teaming experience for generative AI models. It is critical for Congress to leverage these learnings and take action to craft a legislative package that moves U.S. leadership in AI forward.

  • Maintaining a Pro-innovation Approach to AI Safety. The Administration’s actions took these steps for government procured AI systems by establishing sector specific risk-based external test and evaluation requirements. However, due to the limits of Administrative actions, it was unable to cover commercial and enterprise AI use cases. Congress must fill this gap and establish an effective approach to safety for all AI use cases.

  • Funding Key Elements of Government AI Use. Recently federal agencies submitted over 700 different potential use cases for AI. However, despite the clear signal of interest, federal agencies are not funded to take advantage of AI and lack the AI-ready data foundation to do so. It is critical that Congress prioritizes AI funding in the FY25 appropriation process for agencies to help them build the right data infrastructure and begin funding some of the first use cases for agencies to employ AI. 

Moving Forward

Countries globally are accelerating their development of AI, and the United States must harness the full strength of its innovation ecosystem to maintain American AI leadership. Scale looks forward to continuing our work across the industry and federal spaces to help our nation adopt safe, secure, and trustworthy AI.


Read more

December 21, 2023


Scale AI and Austin Community College Host First Public Sector Generative AI Hackathon

Scale AI and Austin Community College District (ACC) recently teamed up to host a hackathon that enabled participants to craft prototypes with practical applications using Donovan, Scale’s AI-powered digital staff assistant. The hackathon, held on December 12 at the ACC Rio Grande Campus ACCelerator, brought together a dynamic blend of talent from ACC students and Soldiers from the Army Software Factory, part of the Army Futures Command, to craft pioneering AI solutions with real-world impact. At the core of this collaborative effort was allowing participants to explore application programming interfaces (APIs) and data to solve relevant challenges.

Students and Soldiers competed to build the most sophisticated projects and usage of Donovan’s model-agnostic chat and retrieval features to explore solutions around model evaluation, real-time knowledge sharing around mission-critical data and more. Industry-leading engineers from Scale provided guidance for participants, sharing their knowledge and experiences around AI, engineering, and building careers in this technical field. Participants worked in teams to share their skills and creativity to develop novel solutions leveraging Donovan’s API-driven capabilities in a supportive and engaging environment.

“This hackathon presented an opportunity to not only show students new artificial intelligence tools and technologies, but also allow them to explore, develop, learn and compete in a friendly environment,” said John Brennan, Scale’s General Manager, Public Sector. “Austin is a great talent hub for organizations like the Army Software Factory thanks to schools like ACC and Texas A&M.”

“Partnering with Scale means that we are not only providing students with the opportunity to share and improve their technical skills, but also that we are sharing with students what a career path at a startup could look like,” said Venancio Ybarra, dean of Computer Science/IT at ACC. “Many of our students here who major in STEM are interested in working in AI, and working with Scale for this event is a springboard to cultivate and explore interest in the industry.” 

The winning team built a system using Donovan to aggregate responses from several different large language model (LLM) responses to user queries and dynamically return the optimal response based on the average embeddings, showcasing Donovan’s model-agnostic platform being used in a strong mode evaluation use case. 

The second place team leveraged Donovan to build a training tool inspired by Jeopardy™ that generates new game templates based on user entered topics, enabling students to learn new concepts in a fun and engaging way. 

The third place team used Donovan to build an educational research database for law enforcement  focused on counter-narcotics. 

This hackathon underscored the power of collaboration within the innovation ecosystem between next-gen tech leaders at ACC, industry pioneers, and future military leaders. The Scale team would like to extend our appreciation to the entire ACC team–both administrators and students–who helped make this event a success!


Read more

December 12, 2023


Efficient and Effective Fine-Tuning Using Mixture-of-Experts PEFT

At Scale, we have always believed that building custom LLMs through fine-tuning is key to unlocking greater performance for any given organization’s specific use case. We work with enterprise customers to implement cutting-edge enterprise Generative AI solutions, combining the best large language models with the latest research techniques and balancing our solutions for both effectiveness with efficiency to optimize model performance.

Recently, Parameter-Efficient Fine-Tuning (PEFT) and Mixture-of-Experts (MoE) techniques have risen in popularity — each with its unique focus. While PEFT prioritizes efficiency, MoE pushes the boundaries of model performance. This blog post will briefly explore the core concepts of PEFT and MoE before diving into a new approach that synergistically combines these methods, offering an efficient and effective way to fine-tune large language models.


Parameter-efficient Fine-tuning (PEFT)

Traditional fine-tuning, where each task requires a distinct set of weights, becomes untenable with models scaling to hundreds of billions of parameters. Not only does hosting different weights for each model become inefficient and cost-prohibitive, but reloading weights for various tasks also proves too slow. PEFT techniques address this by modifying only a small portion of the weights relative to the full model size, keeping the bulk of the model unchanged.

PEFT methods typically require a considerably smaller memory footprint (e.g. < 1% of total parameters) while closely approximating the performance of full fine-tuning. These methods can be broadly categorized:

  • Adapters: Techniques that fine-tune a part of the model or insert small, trainable modules between layers, enabling efficient fine-tuning with minimal additional parameters. Examples include BitFit, (IA)³, and LoRA (and its variants).

  • Prompt Tuning: This involves fine-tuning a set of input “prompts” that guide the model’s responses, adapting output with minimal changes to existing parameters. Methods can be either hand-crafted or learned, with examples like Prefix Tuning and P-Tuning.

Given the multitude of fine-tuning options to choose from, we performed a comprehensive benchmark across these techniques, detailed in our paper, “Empirical Analysis of the Strengths and Weaknesses of PEFT Techniques for LLMs”. Our findings include a detailed decision framework to choose the best technique given the task type (e.g. classification, generation) along with the data volume. For example, there are several dimensions to consider when deciding between memory, performance, and time constraints. In addition, we found that LoRA/(IA)³ could be further optimized by selectively choosing which parts of the model to train, such as only PEFT’ing the last few layers of an LLM, while maintaining performance.

Mixture of Experts (MoE)

Mixture of Experts (MoE) for language models is a modification over the transformer architecture, where the model consists of various ‘expert’ sub-networks. These sub-networks each specialize in a different aspect or types of data. In an MoE model, there is also an additional gating/routing mechanism that dynamically determines which expert or combination of experts is best suited for the given input during inference. This approach enables the model to handle a wider array of tasks and understand the minor nuances of different domains better than a monolithic model. By distributing learning across different specialized experts, MoE models can achieve higher performance and scalability. Recent papers such as “GLaM: Efficient Scaling of Language Models with Mixture-of-Experts” and “Mixture-of-Experts with Expert Choice Routing” offer further insights into this approach. In addition, these methods allow us to sacrifice memory consumption for more efficient floating point operations during training and inference.

Parameter-efficient Mixture of Experts

The paper “Pushing Mixture of Experts to the Limit” came to our attention as it heralds a blend of PEFT and MoE to facilitate both efficient and effective fine-tuning. Intrigued by its potential, we implemented and benchmarked the method ourselves to assess its effectiveness. The work has proposed MoE variations of two popular adapter PEFT approaches: LoRA and (IA)³, which are named MoLORA and MoV respectively. However, this method was only evaluated on the FLAN T-5 models, which is an encoder-decoder model.

We will provide an overview of the MoV approach and delve into the implementation of one of the proposed methods. MoLoRA can be implemented similarly with the main differences being: only modifying the key/value matrices, applying matrix multiplication instead of element-wise, and adding a dimension during computations for LoRA rank.

What is MoV?

Mixture of Vectors (MoV) builds upon the foundational concept of (IA)³, where a pretrained model remains largely unchanged except for three learned vectors per attention block. These vectors interact element-wise with the key, value, and feed-forward layers within the transformer’s self-attention block. The image below from the original paper provides a clear depiction.

Overview of (IA)³, taken from the paper.

In Mixture of Vectors (MoV), the overall concept remains the same, but instead of learning one vector for each of the three tensors, we learn \(n\) (no. of experts) of them and combine them through a routing mechanism. The diagram below from the paper, gives an overview of this.


Next, we will guide you through the implementation of the MoV and Router layers essential for this methodology. Additionally, we’ll discuss adapting these to fine-tune the LLaMA-2 model.


A router is a linear layer that selects which experts to send the input towards. In MoV, the router is combined with the output from our experts, which are (IA)³ modules, and allows for conditional computation instead of using all of our parameters.

class Router(nn.Module):
    def __init__(self, input_dim, num_experts):
        self.ff = nn.Linear(input_dim, num_experts)
    def forward(self, x):
        logits = self.ff(x)
        probs = F.softmax(logits, dim=-1)
        return logits, probs

First, we define a linear layer that takes in the original input dimension and outputs a tensor with the corresponding number of experts. In the forward call, we first compute the logits with our dense layer and also return our probabilities from a softmax.

MoV Layer

The MoV layer combines the probabilities of the router network along with the outputs of each expert, which is an (IA)³ vector. We are computing the following equation, where \(s_i\) is the routing probability of the current expert, x is the token representation, and \(E_i\) is the current (IA)³ vector’s output:

Although we can also implement Top-k routing, where we zero out the non-selected experts, the authors found that soft merging, which is a “weighted average of all experts computed within a specific routing block”, performed the best.

class MoV(nn.Module):
    def __init__(self, linear_layer, num_experts):
        super(MoV, self).__init__()
        # Original linear layer
        self.original_layer = linear_layer
        self.router = Router(self.original_layer.in_features, num_experts)
        self.experts = nn.Parameter(torch.ones(num_experts, linear_layer.out_features))
    def prepare_model_gradients(self):
    def forward(self, x):
        frozen_output = self.original_layer(x)
        _, gating_probs = self.router(x)
        # Compute the weighted sum of expert outputs
        mov_combined = torch.einsum("bse,ed->bsd", gating_probs, self.experts)
        return frozen_output * mov_combined

To implement the MoV layer, we first store the original linear layer along with initializing the Router layer from the previous section and the experts, which are (IA)³ vectors. In the forward pass, we first compute the original output representation and then router probabilities. Afterward, we compute the weighted sum of the expert outputs along with the gating probabilities and rescale the original output. We also provide a prepare_model_gradients() method to set these tunable parameters and freeze the rest of the model in the next part.

Adding MoV to LLaMA-2

We stepped through the implementation specifics, but to get these layers integrated into an actual model, we need to iterate through the entire pretrained LLM and selectively apply theses MoV layers. For our experiments, we use the AutoModelForCausalLM on the Llama-2 decoder-only models.

def adapt_model_with_moe_peft(model, experts):
    # Only modify the key/value and linear activations
    llama_regex_match = "(.*(self_attn|LlamaAttention).(k_proj|v_proj).weight)|(.*LlamaMLP.down_proj.weight)"
    for n, _ in model.named_parameters():
        if, n) is None:
        # Get module that the parameter belongs to
        module_name = ".".join(n.split(".")[:-1])
        module = attrgetter(module_name)(model)
        module_parent_name = ".".join(n.split(".")[:-2])
        module_key_name = n.split(".")[-2]
        module_parent = attrgetter(module_parent_name)(model)
        setattr(module_parent, module_key_name, MoV(module, experts))

    # Freeze base model and set MoV weights as tunable
    for m in model.modules():
        if isinstance(m, MoV):

When we print the model object, which is a Llama-2-7B model, we can see the defined embedding, 32 decoder layers, and language modeling head. Diving into the decoder, each layer consists of a self-attention layer, a multi-layer perceptron, input normalization, and a post-attention normalization. First, we define a regex to match the (IA)³ implementation where the parameters of the query/key and linear activations are modified. Then, we iterate through the model’s layers to find the valid parameters to inject our MoV layer. Finally, we need to freeze the base model and set the MoV layers as tunable, excluding the original layer. Note that using one expert is equivalent to applying (IA)³.

There are other tricks to improve our training efficiency, such as gradient checkpointing or mixed-precision training. After we add these trainable MoV layers into the decoder model, we can calculate the total number of parameters being tuned. With Llama-2 model, we get:

# experts

MoV (7B)









Note that using 1/10/20/60 expert(s) modifies less than 0.001%/0.01%/0.02%/0.05% of the total parameters, respectively. This is still incredibly memory efficient!


We evaluate across 4 different datasets:

  • The ScienceQA dataset is generated from elementary and high school multiple choice questions, where we select around 6000 samples (see our blog How to Fine-Tune GPT-3.5 Turbo With OpenAI API for more on fine-tuning with this dataset).

  • Corpus of Linguistic Acceptability (CoLA) uses 23 linguistics publications to evaluate grammaticality with around 10000 samples. 

  • Microsoft Research Paraphrase Corpus (MRPC) uses newswire articles and checks whether the sentence pairs are or are not paraphrases with 5000 pairs.

  • Recognizing Textual Entailment (RTE) uses news and Wikipedia text for textual entailment, where we are given a premise and hypothesis and evaluate whether these texts logically follow (entail) or do not, with around 3000 samples.

All of our listed experiments use the Llama-2-7b model with the MoV technique. For our MoV runs, we default the learning rate to 2e-4 and run for 10 epochs. For full-tuning, we use a learning rate of 3e-5 and 5 epochs. In addition, we select the checkpoint that corresponds to the lowest validation loss. For evaluation, we use exact string match accuracy across the gold label and prediction.








Science QA
























From our results, we observe that using the MoE PEFT method consistently outperforms PEFT with an average four percent delta. Additionally, in some scenarios, such as with MRPC and RTE, MoV is equal or better than full-tuning our model. Lastly, we should note that increasing the number of experts does not always translate to better downstream performance. For example, Science QA and MRPC increase in performance as we scale from 1 to 60 experts, noting that this task can have even more experts. Both CoLA and RTE drop in performance after the 20th and 10th expert, respectively.

We recommend carefully tuning the number of experts while being mindful of the memory overhead. These trends are similarly observed by the authors. In addition, we can empirically state from our results that MoE PEFT helps close the gap between PEFT methods and full-tuning across both encoder-decoder and decoder-only models.


The MoE PEFT methods have great empirical benefits while being extremely memory effective. We are excited to further experiment with these methods along with providing a working implementation of MoV with Llama-2 models for anyone to try!

As we continue to test more methods, we will add what works best with LLMs to llm-engine so stay tuned for new changes that you can experiment with on your own! We also incorporate these methods into our Enterprise Generative AI Platform (EGP) and work with customers to fine-tune models for their unique use cases, implement cutting-edge retrieval augmented generation, and help them implement Generative AI Applications. We will continue to incorporate the latest research and techniques into our open-source packages, products, and processes as we help organizations unlock the value of AI.


Read more

December 6, 2023


We Fine-Tuned GPT-4 to Beat the Industry Standard for Text2SQL

Our machine learning team at Scale has recently fine-tuned GPT-4 to achieve state-of-the-art performance (84% accuracy) for generalized text-to-SQL translation on one of the most popular benchmark datasets, the SpiderDev Set. In this blog post, we will discuss why text2sql is an important use case, why it is hard in practice, where fine-tuning can help, how we implemented a real-world solution, and finally, what our results were.

Why is Text2SQL important?

Most business decisions today are data-driven decisions. This means that organizations collect, aggregate, and interpret large amounts of available information about their business or the market environment with a set of tools and processes that are often summarized as business intelligence or BI. However, obtaining the relevant pieces of information from the vast amounts of available data typically requires analytical expertise (SQL or similar) and knowledge of the relevant databases, dashboards, or related tools. This often creates a massive bottleneck and reliance on data analysts to build these tools, which then proliferate and become hard to navigate. Multi-billion dollar industries have emerged to provide generalist or highly specialized analytics tools to bridge this gap.

The advent of large language models is poised to change this paradigm with the ability to generate SQL queries directly from natural language questions such as “How many vehicles did we sell last year?”. Building generative models that can robustly generate SQL queries for any given set of databases hence has the potential to disrupt an entire industry and truly democratize access to structured data at large.

Why are LLMs still bad at SQL in the real world?

Running some basic tests using models like OpenAI’s ChatGPT provides very promising results:

Also, looking at the leaderboard of benchmark datasets like the infamous SpiderDev makes it appear that the problem is pretty much solved:

However, despite the impressive code generation capabilities of state-of-the-art language models like GPT-4, they are not immediately useful for generating queries that run on custom, real-world databases. First of all, the LLMs do not know the schema of the databases in question out of the box. The most obvious solution is to provide the schema to the model in addition to the prompt.

However, in many cases, real-world databases will have hundreds of columns with custom names. The schema might not fit into the context window of the prompt and even if it does, the model still does not understand the meaning of the column names and how they relate to each other. For example, does a “date” column in a vehicle sales database record the time the sale was recorded or the time the sale took place? A very robust understanding of typical business terms for the given databases and the column contents is essential, especially to correctly apply aggregation and window functions. The relationships between multiple table schemas are also difficult to convey in the prompt, but this is required for slightly more complex operations like JOINs.

How can fine-tuning and retrieval help to resolve these challenges?

At Scale, we are working with enterprise customers across many industries to build customized Generative AI solutions for their respective use cases. In most of these applications, we fine-tune an underlying base model to solve the relevant business problems at the required accuracy level. Fine-tuning not only can improve the performance of a model for a given task, but can also drive model safety and alignment, ensuring a certain tone and behavior. It is also a good way to improve ROI as it can be used to teach smaller (and cheaper) models a very specific skill and eventually even outperform much bigger, generalized models at this task.

Fine-tuning is an effective way to improve the specificity of a certain skill that the model is capable of performing but has not yet mastered. It can be used to teach a model highly specific terms and instructions and improve its capabilities. A good way to figure out if fine-tuning is going to work is by experimenting with prompt engineering. As a rule of thumb, if prompt engineering shows promising results, then fine-tuning will likely be effective for the given task.

Conversely, fine-tuning is not a good way to add new data or knowledge, such as the database schema or even detailed explanations of columns and their relationships to the model. Instead, this type of context information is best infused into a model using Retrieval Augmented Generation or RAG (see this recent blog post for a deep dive on RAG). Hence, a real-world solution will likely have to include both fine-tuning and RAG to achieve acceptable results.

How did we implement our solution?

Our solution reflects a system intended for enterprise customers. Accordingly, we benchmarked multiple techniques that could be used for real-world use cases against a baseline: 

  • Full database schema with off-the-shelf model (Baseline)

  • Schema RAG with In-context-learning (ICL)

  • Fine Tuned model against Schema RAG and ICL

We’ll now walk through each of these in more detail.

Database Schema Retrieval

The Spider dataset is the standard benchmark for comparing natural language to SQL models and methods. However, real-world enterprise SQL databases differ from Spider in both size and complexity. Whereas 90% of the databases in Spider’s Train and Dev datasets contain fewer than 50 columns, enterprise databases contain up to and beyond 1000 unique columns. This discrepancy renders the common approach of providing the entire database schema in the prompt infeasible for real-world use cases, given token limit constraints and the “lost in the middle problem.”

As many SQL queries require only a fraction of all columns, we solve the above dilemma with a fine-tuned retrieval system, which retrieves those database features relevant to a user’s question. Given a customer’s database schema, we can fine-tune a model to learn the unique ways customers refer to their database. Once the embedding model is deployed into the backend of our Enterprise Generative AI Platform (EGP), we can easily create, populate, and query the retrieval Knowledge Base.

from scale_egp.sdk.client import EGPClient
from scale_egp.sdk.models import S3DataSourceConfig, CharacterChunkingStrategyConfig

# Instantiate client
client = EGPClient()

# This is Pseudocode
embedding_model = client.models().create(
        weights_uri: "s3://model_weights/presigned_url"

# Create a Knowledge Base
knowledge_base = client.knowledge_bases().create(

# Configure knowledge base and uplaod
data_source = S3DataSourceConfig(

chunking_strategy_config = CharacterChunkingStrategyConfig()

upload = client.knowledge_bases().uploads().create_remote_upload(

# Query schema for a given question
query = "What was last month's total expense for service provider X?"
retrieved_schema = client.knowledge_bases().query(

In context learning

With the schema retrieval providing a lower token count, we can supply more relevant context to address misalignment between business terms and the database schema. On initial deployment, we collect terms and information relevant to SQL logic. For example, “the term MPG refers to miles per gallon”, or “stock means filter on asset_type=’equity’”. This initial retrieval mechanism is the primer in the engine of our data flywheel. As users then interact with the tool, we collect real-world samples that can be retrieved for in-context learning. With this additional corpus, we provide queries from similar questions in the context window provided to the LLM.

Fine Tuning

Tying together prompt engineering and the above retrieval mechanisms, we optimize the density of information available to the out-of-the-box model. To get the final accuracy boost, we turn to fine-tuning. For customers, this means collecting real user data and fine-tuning an LLM to learn the nuances of their data and terminology. The fine-tuning not only fills in gaps in the retrieval data but hones in on trends or relationships unknown to users. Once the model is complete, each step is seamlessly integrated with the EGP SDK.

To prove out the viability of this system, we leveraged OpenAI’s Fine-tuning API to train GPT-3.5 and GPT-4. For each question-query pair in the Spider training set, we used the above methodology to create prompts with at most 20 features and five relevant question, SQL query pairs. For each generated prompt, we simply set the target to the respective SQL query. We use the same approach to generate a validation set of question, query pairs from the Spider dev set. After packaging up the train and validation sets into files, uploading the data, fine tuning models and generating validation predictions was completely handled by OpenAI’s robust APIs.

# Create a custom LLM
LLM_MODEL = client.models().create(
        weights_uri: "s3://model_weights/presigned_url"

class Text2SQLApplication:

  name = "Text2SQL"
    description = "Natural language to SQL queries for My Company"
  llm_model =

    def __init__(self, schema_knowledge_base_id: str,
   icl_knowledge_base_id: str):
       self.schema_kb_id = schema_knowledge_base_id
       self.icl_kb_id = icl_knowledge_base_id

    def create_prompt(self, question: str, schema_chunks: List[str], icl_chunks: 
    List[str]) -> str:
        return rag_prompt

    def generate(self, question: str, schema_k: int, icl_k: int) -> str:

        # Retreive relevant schema information
        schema_chunks = client.knowledge_bases().query(

        # Retrieve in-context-learning samples
        icl_chunks = client.knowledge_bases().query(

        # Generate prompt from retrieval
        rag_prompt = self.create_prompt(question, schema_chunks, icl_chunks)
        # Generate SQL
        generate_response = client.completions().create(

        return generate_response.completion.text

Validating Against Spider

We validate the system and benchmark performance using the Spider dataset. For schema retrieval, we fine-tuned a Sentence Transformer to match questions with their relevant database columns and achieved 97% recall@20. With respect to in-context learning, we leverage an out-of-the-box Sentence Transformer. For both GPT-3.5 and GPT-4, we measure the execution accuracy of generated SQL queries for a baseline (prompt with the entire database), RAG (schema retrieval and in-context-learning), and finally the respective model fine-tuned on the RAG prompts. We observe performance improvements at each stage, which results in a best execution accuracy of 83.6% on the Spider Dev set. Thus, we not only achieve state-of-the-art level performance, but have a system optimized to provide enterprise customers with the best commercially available natural language to SQL available.

What are the results?

Baseline: Entire Schema in the Prompt

With the Spider validation data, we can calculate a baseline execution accuracy. For each question-query pair, we pack the entire schema of the respective SQL database (respective to the query and question) into the prompt and ask GPT-4 to answer the question given that context. From this simple baseline, GPT-4 achieves an execution accuracy of 70% (a D+). 

Adding Schema RAG and In-context learning (ICL)

Using a structured way to find the relevant parts of the DB schema with RAG shows consistent improvements in performance across models. For GPT-3.5, we see 7 ppts improvement from 60% to 66% and for GPT-4 a slightly smaller bump from 70% to 73%.

Adding Schema RAG, ICL, and fine-tuning

When additionally fine-tuning the model with specific prompt-response pairs, we see consistent further performance improvements both for GPT-3.5 and GPT-4. The final, fine-tuned GPT-4 with schema RAG and ICL achieves 84% accuracy on Spider, up from the 70% in the baseline version, which marks an impressive 14 ppts improvement. For GPT-3.5 the increase is even more pronounced, reaching 82% (almost as good as GPT-4) with RAG and fine-tuning, which is up 22 ppts from the baseline of using only prompt engineering. For GPT-3.5, the biggest increase is from fine-tuning itself, pushing performance from 66% to 82% with this technique alone.

Below is a comparison of the performance across the three different approaches for both GPT-3.5 and GPT-4.

Let’s look at a practical query example to show the difference between using GPT-4 out of the box versus the RAG and fine-tuned version.

We can see that the fine-tuned model displayed on the right-hand side not only interprets the terms for the natural language query correctly but also applies a better and more efficient query structure, using a subquery instead of a left join.

What’s next?

Our solution is not quite on top of the SpiderDev leaderboard, as most of the submitted architectures rely on prompt engineering and data pre-processing that is extremely tailored to this benchmark. However, our model does achieve top 5 performance and crucially demonstrates a comparable accuracy even when deployed in much more complex, real-world contexts and databases.

If you’re interested in using our Text2SQL model for your business use case or want to learn more about our solutions in fine-tuning and RAG, book a demo below.

Read more

December 5, 2023


Introducing Scale’s Automotive Foundation Model

Autonomous vehicle development requires iterative improvements in perception models through a data engine. These data engines currently rely on a set of task-specific models based around a fixed taxonomy of objects and scenarios to identify. However, there are two critical limitations to existing data engines:

  1. Current models are task-specific and limited to a fixed taxonomy. As data requirements for task types and taxonomies change over time, new models must be trained from scratch

  2. Safe deployment requires detecting a ‘long tail’ of rare events and scenarios, which are not captured in any fixed taxonomy

Today, we are introducing Scale’s Automotive Foundation Model (AFM-1): a new model utilizing transformer modules trained on diverse, large-scale street scene data that is capable of handling multiple vision tasks with new taxonomies without requiring fine-tuning. This model is a major step for the autonomous vehicle  research community as it is the first zero-shot model that is generally available and reliable enough to use off the shelf for experimentation. It delivers state-of-the-art performance both in zero-shot regimes and once fine-tuned on specific datasets. Check out the demo to try AFM-1 with your own images.

AFM-1 is a single model that has been trained on millions of densely labeled images across five computer vision tasks: object detection, instance segmentation, semantic segmentation, pantopic segmentation, and classification.


AFM-1 Architecture

AFM-1 is a neural network that fuses text and image features and then decodes them through a transformer into segmentation masks, detection bounding boxes, objects classes, and a global image class. The model  is trained on a large-scale internal dataset to maximize the similarity between language embeddings and object-based visual features. The architecture connects language and visual features presented in OpenAI’s CLIP research to dense, object-based perception. Training in this open-vocabulary modality with a contrastively-pretrained text embedding model enables two key breakthroughs: 

  1. Segmentation and detection of similar concepts with reduced training data. For example,“traffic light” and “traffic signal” produce nearly identical results. 

  2. Removing the requirement of training a new model any time the taxonomy changes since concepts are based on language similarity and not training output layers hard-coded to a set of output classes.

Today, changing data taxonomy is slow and cumbersome. With AFM-1, autonomous vehicles programs can iterate on data requirements without being locked into a fixed taxonomy.

We are initially releasing detection and instance, semantic, and panoptic segmentation capabilities, with classification coming in a future release.

State of the Art Performance

AFM-1 is the most advanced vision foundation model for automotive applications in the world.  The model reaches state of the art results on Berkeley Deep Drive (BDD) and Cityscapes segmentation in both zero-shot and fine-tuned regime

AFM-1 is state of the art across zero-shot benchmarks and when fine tuned on many benchmarks even without hyperparameter tuning.

Benchmarking against historical progress, our improvement on Cityscapes for segmentation is equal to four years of progress by the entire open source community.

Source: Cityscapes Leaderboard

Currently, zero-shot AFM-1 is available in our public demo. We are also working with our customers to offer custom, fine tuned versions of the model. 

Using AFM-1 to Accelerate Autonomy Development

Autonomy programs improve perception by constructing a data engine to iteratively improve neural networks with higher quantities of high-quality data. Data engines require collecting and labeling data, training a model on that data, evaluating where the model is failing, curating the data to improve performance, and then repeating the cycle:

Today, running a data engine requires significant iteration of task-specific models. As data requirements and taxonomies change over time, new models must be trained from scratch. This iteration process not only slows down development speed, but can be costly to label as well.  

AFM-1 will accelerate the data engine through:

Accelerated Curation
Finding the specific self-driving cases that are needed for model improvement, such as specific scenarios with emergency vehicles, is a significant challenge today. With AFM-1, developers can search with open vocabulary queries for small objects independent of taxonomy - without needing to fine tune curation models to a specific taxonomy in advance.

AFM-1 detection of “emergency vehicle”

Ground Truth Data Labeling
Prior to AFM-1, obtaining high-quality training data for autonomous vehicle use cases required a high degree of human-in-the-loop intervention, as standard ML autolabels would not be of high enough quality to iterate on the perception stack. With AFM, nascent perception programs can quickly experiment on a taxonomy and generate a testable model without human intervention at a fraction of the cost, while for production programs, AFM-1 prelabeling and linters will be used to bring down labeling costs for our customers, while reducing turnaround time. 

“This is perfect timing for our roadmap. We want to incorporate auto-labeling into our solution." Director of Perception, autonomous vehicle company


AFM-1 is not built for direct consumption of inferences for safety critical applications in autonomous vehicles. Instead, AFM-1 should be used for curation, taxonomy experimentation, pretraining, evaluation and other offboard tasks. Further, there are domains of data that will not work reliably, such as aerial imagery, as of yet.

The open vocabulary nature of AFM-1 brings similar challenges to the evaluation of models as found in generative AI, such as with large language models, where the massive diversity of tasks supported must be evaluated by humans. To this end, we have released our public demo and will have a free trial period in Scale Nucleus.

In future work, we look forward to releasing more details on the evaluation of our automotive foundation models.

The Future of AFM

We believe AFM-1 is a first step in changing the paradigm of how self-driving programs approach perception. In the future, we expect our autolabels to not only become more accurate, we expect them to work across a wider range of modalities, potentially including object attributes, 3D LiDAR data, aerial imagery, and visual question answering and other applications that push generative AI and Reinforcement Learning from Human Feedback (RLHF) into automotive. We also expect to offer substantial model performance gains for our customers as we fine-tune AFM-1 with their data. We believe the possibilities of future iterations of AFM are broad, and we’re excited to gather even more user feedback to discover new use cases that will accelerate the development of automotive vehicles. 

Using AFM

To try AFM-1 for yourself, please visit our demo environment here. Further, the model will be integrated into Nucleus, our dataset management product, so that you can run the model over your data, view and query predictions in various ways, and utilize the resulting autolabels for model training and experimentation. If you have any questions about how AFM can be used in your self-driving program, or you’re interested in fine-tuning AFM-1 on your own data, you can reach out to your account manager or book a demo below.

Read more

November 27, 2023


Scale collaborates with NVIDIA to power the next generation of LLMs with NeMo SteerLM

Scale AI is proud to collaborate with NVIDIA to power the next generation of LLMs for generative AI. By leveraging Scale’s high-quality training datasets and NVIDIA NeMo SteerLM - a simple LLM alignment technique that allows dynamic steering of models during inference - developers can create applications for a variety of enterprise use cases including education, chat bots, and gaming.


High-quality data is a critical resource in training the world’s leading aligned models, including for techniques like SteerLM. Scale’s Data Engine powers the most advanced LLMs with world-class RLHF, expert-generated data and model test & evaluation. Scale produces more RLHF data than anyone else: on track for two million hours (228 years) of data in 2023.


By combining NVIDIA’s latest research in generative AI with Scale’s datasets, this collaboration provides enterprises with high-quality data that can be used to align LLMs to follow instructions. SteerLM demonstrates new capabilities in education, enterprise, gaming, and more. For example, SteerLM enhances education by tailoring language model responses to individual learning preferences such as verbosity and complexity. For gaming, SteerLM strengthens non-player characters (NPCs) by enabling customizable personalities for immersive, human-like interactions during gameplay.


To support generative AI advancements, NVIDIA and Scale have open sourced the dataset. This dataset contains 37k samples with various dimensions including helpfulness, correctness, complexity and verbosity. Releasing this evaluation data enables the broader research community to develop language models and continue innovating research. For example, enterprise software developers can leverage NVIDIA’s open source toolkits and advanced models across a variety of industry applications, without having to build LLMs from scratch.


Use Cases for NVIDIA SteerLM


Powered by Scale's training data, NVIDIA SteerLM technique can unlock new capabilities in education, gaming, enterprise chat bots, and more. Below, we’ve shared a few early use cases for SteerLM. We look forward to seeing developers customize models to fit their own needs across many more applications.


Education: SteerLM can be used to customize LLM responses based on the personalized needs and preferences of teachers and students. Every student learns at their own pace and is critical to achieving learning outcomes both inside and outside of the classroom. SteerLM has knobs for Complexity and Verbosity of responses, which can be used to adjust how teachers and students would like the LLM to answer their queries. In addition, SteerLM can be trained to be more factually correct and coherent, making it suitable for education settings.


Gaming: NVIDIA integrated SteerLM into an AI system for more interactive and immersive non-playable character (NPC) experiences. SteerLM strengthens AI NPCs by enabling developers to customize their personality for more emotive, realistic dialogues and emotions. Specifically, SteerLM trains models that align generated text with adjustable attributes, allowing developers to easily modify model behavior through attribute sliders rather than full model retraining.


Retail: SteerLM can be used to power chatbots that generate responses tailored to different groups of retail customers. For example, some customers may prefer more detailed explanations, while other shoppers may want shorter answers. SteerLM enables the chatbot to dynamically tune its responses for each specific user.




Looking Ahead


Our SteerLM collaboration demonstrates the immense potential of combining the latest advancements in generative AI with expert-curated datasets. As NVIDIA and Scale continue to partner on additional data initiatives, including multi-turn retrieval augmented generation (RAG), we hope to unlock even more powerful enterprise applications across industries.


Generative AI is evolving rapidly. As techniques like SteerLM become more capable, the need for high-quality training datasets will continue to grow to build better models. Through our world-class Generative AI Data Engine, Scale is committed to delivering the best datasets tailored to your specific needs.


Together, NVIDIA and Scale are excited to open source high-quality datasets that can be used to train LLMs. We look forward to empowering more organizations to leverage NVIDIA’s managed services and models to transform their industries and shape the future of AI.


Learn more about NVIDIA’s SteerLM dataset by checking out the research paper and dataset. Get started with SteerLM today using NVIDIA NeMo, an end-to-end framework for building, customizing and deploying generative AI models anywhere.


Learn more about Scale’s Data Engine and how it can power your team’s model development.

Read more