Build vs. Buy
A practical guide for enterprise AI leaders on when to build, when to buy, and why a hybrid “build + buy” approach speeds time-to-value while reducing risk.
Enterprise Guide: Build vs. Buy
Enterprise AI leaders face a central tension: the existential risk of being left behind versus the business risk of investing without a clear path to ROI. The biggest AI risk today isn’t just a bad bet, it’s a slow one.
For most companies, the reality of scaling AI is a cycle of progress hampered by:
-
A 70-80% project failure rate.
-
Bottlenecked teams that create huge hidden costs.
-
Stalled projects that lack clear, quantified business value.
These problems force organizations to confront the traditional “build vs. buy” dilemma. Building alone is slow and risky, causing many of the stalled projects and bottlenecks listed above.
This high failure rate is precisely what makes the “buy” option so tempting. It’s why every customer is asking: “Why not just use an OOTB model from OpenAI or Anthropic? Isn’t that ‘good enough’?”
While both approaches can work for simple use cases, each presents significant challenges for high-value AI. Building alone is too slow. But buying alone and subscribing to a generic model locks you into rigid, short-term solutions, creates tech debt, deepens vendor lock-in, and can’t evolve with your business.
The answer is to “buy the build”: buy platforms and partners that give you the ability to build custom AI systems in-house.
The Allure and the Reality of “Build Alone”
For organizations with strong technical talent, building in-house has historically been the winning strategy. It’s a proven model for maintaining control, customization, and owning core IP. It is perfectly logical to assume the same will hold true for AI.
The challenge is that generative AI is different in a few critical ways, presenting new, high-friction obstacles that cause this trusted approach to fail at alarming rates. This is why research from MIT shows that the companies least successful at deploying AI were the ones who tried to build tools themselves, without outside help.
This high project fallout rate is a symptom of two deep, strategic issues:
-
Applying AI to the Wrong Problems: Success requires deep expertise to identify which business challenges are truly suited for AI. As Scale CEO Jason Droege noted in a recent interview, many companies fail because they try to apply the technology to the “wrong kind of problem.” Without expert guidance, teams invest heavily in solutions that AI can’t solve effectively or that don’t deliver meaningful business value. This strategic misstep is often compounded by underestimating the team required to succeed.
-
The True Team Cost and Adoption Risk: A successful AI initiative requires more than just engineers. Without a product-centric team that includes Product Managers and Designers, solutions lack user focus, suffer from low adoption, and fail to be integrated into core business processes.
These challenges are consequential. Every month lost to a slow or stalled build is lost revenue, unrealized efficiencies, and a diminishing competitive advantage.
The Rigidity Trap of “Buy Alone”
The urgency to buy a solution is often a reaction to a problem already underway: 78% of knowledge workers are bringing their own AI tools to work. This “shadow AI” creates serious risk, pressuring leaders to find an official solution. This pressure typically pushes organizations down two common, but flawed, paths.
The first path is simply licensing a foundational model from a major provider. However, as our customers have pointed out, this approach might benefit the employee, but not the company. A model is just an engine, not a complete vehicle. It lacks the critical orchestration and intelligence layer required for enterprise-grade security, integration, and reliable performance.
Worse, this approach means you are failing to capture your own IP. The data generated from your employees’ interactions provides a one-time benefit and then vanishes. It isn’t used to improve your own system. This is why it is critical to own your data, reports, and the feedback loops. Without them, your firm’s knowledge stays scattered, and each employee’s work fails to strengthen your own.
The second path is taking a step up to a packaged tool or service, which has its own set of limitations. These solutions are built for the 80% of a problem that is common to all customers. A business’s true competitive advantage, however, lies in its unique 20%, which is almost always tied to its complex and proprietary data environments. These generic tools simply can’t handle that level of specificity.
Ultimately, both of these paths lead to a tech debt mortgage. Rigidity forces teams to create brittle workarounds. As the business evolves, it inevitably outgrows the tool, leading to a costly “rip-and-replace” project that erases any initial savings.
“Build and Buy” Delivers ROI
The winning strategy resolves the false choice by providing a unified, foundational platform that combines the speed of “buying” with the advantage of a “build” solution.
This “build and buy” model is guided by a core principle: Own what creates a unique business advantage and partner for what provides speed and expertise. This means investing in a centralized AI platform while co-developing the specific application logic that sets the business apart.
This is the strategy that allows you to centralize to lead the market, not chase it, turning your firm’s expertise into true leverage and enabling the reuse of firm-specific patterns. This collaborative model pairs our AI expertise with an organization’s invaluable domain knowledge. A successful AI tool requires an organization’s internal experts to give feedback and constantly improve the system.
Most importantly, this hybrid model is designed to win executive approval. It replaces purely conceptual arguments with the quantified ROI projections leaders require to fund a project. By building on a proven platform with a clear scope for custom development, it delivers a concrete investment case with a predictable path to value.
Activating “Build and Buy” with Scale
What leaders need is a proven partner who can guide them to success without an ulterior motive, like locking them into more compute, selling model credits, or profiting off training with their enterprise data.
Scale activates the “Build and Buy” strategy by acting as that partner, combining our technology, our team, and our access to the latest research. This hands-on, co-development approach ensures our partners build valuable, proprietary IP tailored to their exact needs, not a generic solution that can’t evolve.
-
How We Accelerate Speed-to-Value: Our foundational platform and forward-deployed teams eliminate common bottlenecks, moving projects from concept to production in months, not years.
-
How We Reduce Risk: We bring experience from solving the hardest, most complex AI problems, which is supported by our world-class evaluation suite, security, and red-teaming services. This expertise helps identify the right problems to solve and avoids costly failures.
-
Engagement Flexibility: Our “mix-and-match” model adapts to each team’s needs. We can fill roles a team doesn’t have—providing a full pod of MLEs, SWEs, and PMs—or work alongside the roles they do, augmenting their existing talent to be maximally capital-efficient.
-
Deployment Flexibility: A huge advantage over rigid SaaS solutions, we provide maximum flexibility with how and where solutions are deployed: in a customer’s own environment, in our cloud, or in a hybrid model to meet any security or data governance requirement.
Together, these elements ensure the result isn’t just a functional tool, but a proprietary AI asset that grows in value.
Bottleneck to Breakthrough
The era of exploration is over and the mandate is now measurable ROI. The best way to move forward is to build smarter, agentic solutions that learn directly from your best people.
At Scale, we translate our deep expertise from working with frontier AI labs into a co-development partnership that converts your team’s knowledge into a proprietary, continuously improving asset. Let’s identify one high-value business outcome and build your first agentic solution, turning your team’s unique expertise into a durable competitive advantage.
If you’d like support evaluating the build-versus-buy decision for your organization, we’re here to help. The companies that get the most out of AI are the ones that think foundationally and long-term and Scale partners with you at every step, ensuring you make the right strategic choices and capture lasting value from your AI investments.
CTA: Visit us to learn more about the enterprise AI offering and request to speak to the team here: https://scale.com/genai-platform
About Scale
Scale is fueling the generative AI revolution. Built on a foundation of high-quality data, frontier-grade expertise, and deep partnerships with leading model builders, Scale enables enterprises to build, evaluate, and deploy reliable AI systems for their most important decisions.
Working with Scale, organizations can rapidly develop custom AI agents that learn their unique workflows, tools, and skills—powered by the Scale GenAI Platform, the industry-leading platform for building and controlling advanced, continuously improving agents.
Learn more about our approach for enterprise AI transformation: https://scale.com/enterprise/agentic-solutions
Guide to AI for the Intelligence Community
This guide covers applications of artificial intelligence for the Intelligence Community.
Introduction
Intelligence continues to act as a crucial lever that provides a superior knowledge advantage for national and homeland security. Technology is paramount to securing and maintaining a competitive edge. One of the goals in the latest National Intelligence Strategy from the Office of the Director of National Intelligence specifically outlines that the Intelligence Community (IC) needs to deliver interoperable and innovative solutions at scale by leveraging state-of-the-art technology deliberately, lawfully, and ethically.
Given the recent breakthroughs in generative artificial intelligence (AI) and large language models (LLMs), the Intelligence Community is considering how to take advantage of these new capabilities to detect, assess, disrupt, and defeat threats to the United States. The U.S. government’s initiative to adopt generative AI is already in motion. Last year, the Department of Defense established a generative artificial intelligence task force to assess, synchronize, and employ generative AI capabilities across the department. Soon after, President Biden issued an executive order on the safe, secure, and trustworthy development and use of artificial intelligence. Following the Executive Order, the Department of Defense and the Defense Intelligence Agency (DIA) both released an AI Strategy.
In this guide, we will dive deeper into the importance and benefits of AI, explore essential use cases, and provide insights on effectively implementing AI within the Intelligence Community. You will gain a comprehensive understanding of how AI can be harnessed to enhance intelligence operations and the decision-making processes.
AI for Intelligence: Why is it Important?
Why are recent advancements in generative AI significant? Generative AI and large language models are unique in their ability to understand and generate data across a variety of different modalities including text, image, video, and audio. These innovations offer an unprecedented level of human-like intelligence and capabilities. For the Intelligence Community, AI can act as a force multiplier for your staff. Case in point, Lakshmi Raman, Chief Artificial Intelligence Officer at the CIA, mentioned that “it’s very important that when we’re using AI systems to collaborate with our officers, we make sure their tradecraft is incorporating this new, sometimes novel technology.”
A few benefits of AI for intelligence include:
Improved Efficiency: AI can process and ‘understand’ the content and context contained within thousands of documents. Such documents include classified intelligence reports, historical briefings, open source intelligence (OSINT) such as leaked datasets, signals intelligence (SIGINT) such as encrypted messages, geospatial intelligence (GEOINT) such as synthetic aperture radar imagery, and much more. AI leverages advanced algorithms to process large volumes of data in minutes, whereas humans can typically read around 250 words per minute. AI can also assist with tasks associated with reading comprehension. Intelligence analysts can use AI to quickly gather context across a large corpus of information, conduct analysis including named entity recognition, and semantic relationship mapping.
Enhance Performance: AI uses algorithms and models to make deterministic and probabilistic predictions that rival and can outperform the speed of human analysis. For example, as part of Project SABLE SPEAR executed by the Defense Intelligence Agency (DIA), a project team worked with a startup to expose an illicit network responsible for the global tracking of fentanyl using AI. Using AI to explore DIA and open-source datasets, the company’s AI methods identified 100 percent more companies engaged in illicit activity, 400 percent more people so engaged, and counted 900 percent more illicit activities. As referenced in the DIA’s Lessons Learned, “the AI approach identified analytically relevant variables that our analysts probably would never have come up with and made instantaneous associations for those variables across multiple, often complex, data sets.”
Novel Insights: The amount of data being generated on a daily basis is already exceeding humans’ capacity to consume, process, and make informed decisions at lightning fast speeds. AI can provide exhaustive analysis and surface data that could be overlooked to assist intelligence units with analytical processes, such as preparing a Joint Intelligence Preparation of the Operational Environment (JIPOE).
Task Assistance: Intelligence units are constrained by manpower, time, and resources. Tasks such as daily briefings can be manual, cumbersome, and result in bottlenecks to information sharing. Generative AI capabilities can accelerate and even automate repetitive tasks. Analysts can use AI to take a first pass at writing a SITREP in the format and tone necessary, while including the information end users need. Analysts can provide feedback to a model to improve performance and refine writing capabilities to meet specific forms and styles.

AI in the Intelligence Community: Use Cases
There are a number of different use cases for AI in Intelligence. In this guide, we will focus on how AI can assist with the intelligence cycle. These operations aim to provide policymakers, military leaders, and other senior government leaders with relevant and timely intelligence. These practices are followed by independent agencies including the ODNI and CIA as well as intelligence elements of other departments and agencies including Department of Treasury, Department of State, Department of Justice, and Homeland Security can explore similar use cases for Intelligence Cycles. Similar practices like intelligence operations for the Joint Intelligence Process are followed by DOD intelligence units.

Planning and Direction
Intelligence officers conduct planning exercises to allocate resources for operations. This includes defining priority intelligence requirements (PIR) and request for information (RFI) management for Intelligence, Surveillance, and Reconnaissance (ISR). Planning requires continuous evaluation, assessment, and updating PIR for intelligence needs.
It is a time and labor intensive process to develop a PIR that prioritizes asset collection and analyzes resources in order to synchronize intelligence assets for a mission. Tactically, this can entail intelligence staff collecting information across cross-functional units like cyber and information operations to define requirements, specifying priorities, and developing and refining tasking requirements. Generative AI can help streamline the drafting process by automating the assembly of comprehensive and precise requirement documents and ensuring alignment with military standards and operational objectives. Products like Scale Donovan enable intelligence units to upload documents that will inform requirements, leverage LLMs to triage historical requirements, and rapidly convert new information to actionable requirements.
Using Scale Donovan, intelligence organizations can develop exhaustive requirements for downstream units to produce decision-ready intelligence products. Generative AI capabilities can accelerate PIR creation and transform information to intelligence with predictive and timely analysis.
Collection
A variety of different mediums can be used to capture information. Methods vary from satellite surveillance, human sources, communication or electronic transmission, and open-source platforms like the Internet or commercial databases. Raw information forms the basis of intelligence that is later examined and evaluated. With any means of collection, ISR could benefit from prioritized targeting to optimize resource expenditure and monitor physical domains.
Computer vision machine learning models can assist with automatic target recognition and reduce the noise within information. These models embedded in hardware like small unmanned aircraft systems can automate away the dull, dirty, and dangerous aspects of ISR. Powerful machine learning models require sufficient high quality annotated data for training that improves performance. Intelligence units can leverage platforms like Scale Data Engine to identify model vulnerabilities, curate datasets, and enhance model performance.
For example, the Department of Homeland Security can leverage machine learning models to proactively identify suspicious vehicle patterns using real-time imagery intelligence like videography and radar sensors. The U.S. Customs and Border Protection (CBP) successfully leveraged machine learning models to pin down a suspicious vehicle and arrest a driver hiding narcotics.

Processing and Exploitation
Readying information for analysis requires substantial resources. Furthermore, a high bar for quality, speed, and depth of analysis is necessary for insights to inform decision making. Intelligence units face the challenge of covering massive ground in analysis. They need to evaluate political, military, economic, social, information, and infrastructure systems across different types of intelligence.
AI can expedite information processing to ready information for analysis and dissemination. For example:
-
The Cybersecurity and Infrastructure Security Agency (CISA) uses machine learning to organize vulnerabilities in critical infrastructure like power plants, pipelines, and public transportation.
-
Wiretaps and transcripts can be understood with computer audition and generative AI translation capabilities to convert different files into human-readable formats far faster than human translation.
-
Cryptanalysts can use generative AI models to uncover patterns and decode messages that are encrypted with the intent to avoid human comprehension.
-
By using AI, Open Source Exploitation Officers can prioritize specific sections or documents that require deeper analysis from media, gray literature, or commercial data sources.
-
Targeting analysts can leverage retrieval augmented generation techniques with generative AI to ensure comprehensive research into the accessible corpus of knowledge provided by allies and interagency partners.
Using the latest AI capabilities to solve these use cases will require intelligence teams to select a solution that supports multiple models - adopting the leading capabilities from commercial to the Public Sector.

Analysis and Production
A thorough intelligence briefing offers stakeholders decision-making information via detailed Indications and Warnings (I&W), underpinned by comprehensive intelligence analysis. Analysis can often require specific techniques and extensive methodologies to ensure there is sufficient information to come to a decision. For instance, analysts may follow structured analytic techniques like an analysis of competing hypotheses in order to determine the likelihood that specific indicators have a high probability of a threat.
AI can help teams understand the operational environment by anticipating and providing decision-quality information. Teams can conduct predictive analysis and provide tailored intelligence assessments for decision making. Generative AI models can simulate a persona to follow multi-step instructions to mirror intelligence techniques and provide stellar analysis.
Intelligence units can leverage tools like Scale Donovan to simulate warfighter exercises to derive insights from lessons they learn. Donovan could also reduce blindspots by simulating a red-cell to conduct Team A/Team B exercises, and surface conceivable ways a plan may fail.

Dissemination and Integration
Policymakers, military commanders, and senior government officials receive completed intelligence reports, which inform their decision-making processes. These reports and briefs are delivered on a frequent basis. The Intelligence Community is responsible for critical updates, including the President’s Daily Brief.
Generative AI can play a key role in assembling reports in the desired structure, format, and with the necessary information so that dissemination follows a “write for release” culture. Generative AI solutions can reduce the manual time and effort required to write content for documents like situation reports (SITREPs) and mirror the analytic tradecraft found in finished intelligence. These solutions can help maintain classification guidelines and releasability requirements to avoid misclassification and inhibit accurate intelligence.
Evaluation and Feedback
While evaluation is often glossed over, this process is critical to better meet customer needs. Assessment and feedback are necessary to ensure intelligence priorities, planning, and operations are aligned to support the larger mission.
Generative AI enables intelligence teams to conduct evaluation from a different perspective. Teams can use Scale Donovan to evaluate reports and briefs for relevance, bias, and accuracy. Comparing analysis covered in hundreds of pages is made easier with generative AI as large language models can parse and examine documents for notable differences. Generative AI solutions can also be embedded into tools that are used throughout the intelligence cycle. Scale Donovan can act as a coding assistant to help debug and optimize code on classified networks - expediting the time to develop robust internal tools.
How to Implement AI for the Intelligence Community
In order to operationalize artificial intelligence, the Intelligence Community must maintain a high standard for standard software implementation practices (e.g., DevSecOps, IT infrastructure requirements, cybersecurity) to ensure that deployments are dependable. This is just the baseline. Intelligence units must outline an AI strategy that aligns to the mission. Intertwining AI as part of the intelligence cycle cannot come at the expense of jeopardizing the larger mission.
By considering the following criteria, the Intelligence Community can have a clear path to integrating AI into daily workflows:
-
Define current limitations of existing technologies and pain points: Uncover where existing resources fall short and are a detriment to mission success. For example, if dissemination of briefs are repeatedly delayed, diagnose what piece of the process falls short. Is the unit understaffed? An upstream bottleneck?
-
Identify the use case: Determine where AI can fit in the intelligence cycle. Prioritization should be given to use cases that not only address pressing intelligence questions but also leverage AI's strengths, such as large-scale analysis, pattern recognition, or predictive modeling. Engaging stakeholders to gather input and assess feasibility is crucial.
-
Evaluate build protocol: Come to a decision on building in-house or using external vendors to address the use case. Consider time to deploy, AI expertise, sensitivity, and scalability. While there’s a new popular AI trend every week, the Intelligence Community should build sustainable long term assets that deliver lasting operational value.
-
Use data as an intelligence asset: AI models are susceptible to model drift as real world data and objectives evolve. High quality data is key to training AI models to maintain and improve high precision, accuracy, and reliability. Enriched data assets ensure that the Intelligence Community can leverage AI’s full potential. Read more about enhancing data quality with Scale’s Guide to Data Labeling.
-
Test and Evaluation: Ensure that any system meets technical performance and safety specifications. Measure the mission effectiveness of any solution. Testing artificial intelligence and large language models requires benchmark tests that are specific to use cases and criteria for mission success. Scale has committed to provide the CDAO a framework to deploy AI safely by measuring model performance, offering real-time feedback for warfighters, and creating specialized public sector evaluation sets to test AI models for military support applications. Safety practices including vulnerability scans and red-teaming can probe for model weaknesses and help to maximize AI performance and safety.
Conclusion
Gen AI will revolutionize the intelligence process and significantly expand the capabilities of intelligence agencies beyond their current limits. The Intelligence Community should consider AI adoption in order to stay ahead of adversaries and ensure a decision advantage for national and homeland security. Advancements in Generative AI can soon become a standard tool for the IC. Investments in AI software, infrastructure, and testing and evaluation can help translate advancements in generative AI to mission success. Scale provides a portfolio of products tailored specifically to meet the needs of the Intelligence Community, ensuring readiness for the challenges ahead.
Test and Evaluation Vision
Building Trust in AI: Our vision for testing and evaluation.
Background & Motivation
Over the past year, large language models have quickly risen to dominate the public consciousness and discourse, ushering in a wave of AI optimism and possibility, and upending our world. The applications of this technology are endless—from automation across professional services to augmentation of medical knowledge, and from personal companionship to national security. And the rate of technological progress in the field is not slowing down.
These newly unlocked possibilities undoubtedly represent a positive development for the world. Their impact will touch the lives of billions of people, and unlock step-function advancements across every industry, with potentially greater implications on the future of our world than even the internet. But they are also not without their risks. At its most extreme, AI has the potential to strengthen global democracies and democratic economies—or be the decisive implement that enables the grip of authoritarianism. As Anthropic wrote in a July 2023 announcement, “in summary, working with experts, we found that models might soon present risks to national security, if unmitigated.”
At Scale, our mission is to accelerate the development of AI applications. We have done this by partnering deeply with some of the most technologically advanced AI teams in the world on their most difficult and important data problems—from autonomous vehicle developers to LLM companies, like OpenAI, leading the current charge. From that experience, we know that to accelerate the adoption of LLMs as effectively as possible and mitigate the types of risks which harbor the potential to set back progress, it is paramount we adopt proper safety guardrails and develop rigorous evaluation frameworks.
Here, we outline our vision for what an effective and comprehensive test & evaluation (“T&E”) regime for these models should look like moving forward, how that leverages human experts, as well as how we aim to help service this need with our new Scale LLM Test & Evaluation offering.
Understanding the T&E Surface Area
Defining Quality for an LLM
Unlike other forms of AI, language generation poses particularly unique challenges when it comes to objective and quantitative assessment of quality, given the inherently subjective nature of language.
While there are important quantitative scoring mechanisms by which language can be assessed, and there has been meaningful and important progress in the field of automated language evaluation, the best holistic measures of quality still require assessment by human experts, particularly those with strong writing skills and relevant domain experience.
When we discuss test & evaluation, what do we really mean? There are broadly five axes of “ability” for a language model, and what T&E seeks to enable is effective adjudication of quality against these axes. The axes are:
- Instruction Following—meaning: how well does the model understand what is being asked of it?
- Creativity—meaning: within the context of the model’s design constraints and the prompt instructions it is given, what is the creative (subjective) quality of its generation?
- Responsibility—meaning: how well does the model adhere to its design constraints (e.g. bias avoidance, toxicity)?
- Reasoning—meaning: how well does the model reason and conduct complex analyses which are logically sound?
- Factuality—meaning: how factual are the results from the model? Are they hallucinations?
When viewing a model through this framework, we group evaluations against these axes as either evaluating for capabilities or “helpfulness,” or evaluating for safety or “harmlessness.”
An effective T&E regime should address all of these axes. Conceptually, at Scale we do this by breaking up the question into model evaluation, model monitoring, and red teaming. We envision these as continual, ongoing processes with periodic spikes ahead of major releases and user rollouts, which serve to mitigate the drift and development of large language models.
Model Evaluation
Model evaluation (“model eval”), as conducted through a combination of humans and automated checks, serves to assess model capability and helpfulness over time. This type of evaluation consists of a few elements:
- Version Control and Regression Testing: Conducted on a semi-frequent basis, aligned with the deployment schedule of a new model, to compare model versions.
- Exploratory Evaluation: Periodic evaluation, conducted by experts at major model checkpoints, of the strengths and weaknesses of a model across various areas of ability based on embedding maps. Culminates in a report card on model performance, and accompanying qualitative overlay.
- Model Certification: Once a model is nearing a new major release, model certification consists of a battery of standard tests conducted to ensure minimum satisfactory achievement of some pre-established performance standard. This can be a score against an academic benchmark, an industry-specific test, or a separate regulatory-dictated standard.
Model Monitoring
In addition to periodic model eval, a reliable T&E system requires continuous model monitoring in the wild, to ensure that users are experiencing performance in line with expectations. To do this passively and constantly, monitoring relies on automated review of all or a rolling sample of model responses. When anomalous or problematic responses are detected, they can then be escalated up to an expert human reviewer for adjudication, and incorporated into future testing datasets to prevent the issue.
A T&E monitoring system of this variety should be deeply embedded with reliability and uptime monitoring, as continuous evaluation for model instruction following, creativity, and responsibility become new elements of traditional health checks on service performance.
Red Teaming
Finally, while point-in-time capability evaluation and monitoring is important, they are insufficient on their own in ensuring that models are well-aligned and safe to use. Red teaming consists of in-depth, iterative targeting by automated methods and human experts of specific harms and techniques where a model may be weak, in order to elicit undesirable behavior. This behavior can then be cataloged, added to an adversarial test set for future tracking, and patched via additional tuning. Red teaming exists as a way to assess model vulnerabilities and biases, and protect against harmful exploits.
Effective expert red teaming requires diversity and comprehensiveness across all possible harm types and techniques, a robust taxonomy to understand the threat surface area, and human experts with deep knowledge of both domain subject matter and red teaming approaches. It also requires a dynamic assessment process, rather than a static one, such that expert red teamers can evolve their adversarial approaches based on what they’re seeing from the model. As outlined in OpenAI’s March 2023 GPT-4 System Card:
Our approach is to red team iteratively, starting with an initial hypothesis of which areas may be the highest risk, testing these areas, and adjusting as we go. It is also iterative in the sense that we use multiple rounds of red teaming as we incorporate new layers of mitigation and control, conduct testing and refining, and repeat this process.
The types of harms that one would look for is varied, but includes cybersecurity vulnerabilities, nuclear risks, biorisk, consumer dis/misinformation, and any technique type which may elicit these.
Another factor to consider in red teaming is the expertise and trustworthiness of the humans involved. As Google published in a July 2023 LLM Red Teaming report:
Traditional red teams are a good starting point, but attacks on AI systems quickly become complex, and will benefit from AI subject matter expertise. When feasible, we encourage Red Teams to team up with both security and AI subject matter experts for realistic end-to-end adversarial simulations.
Anthropic echoed this in their own July 2023 announcement, writing:
Frontier threats red teaming requires investing significant effort to uncover underlying model capabilities. The most important starting point for us has been working with domain experts with decades of experience [...] However, one challenge is that this information is likely to be sensitive. Therefore, this kind of red teaming requires partnerships with trusted third parties and strong information security protections.
Because red teamers are often given access to pre-release, unaligned models, these expert individuals must be extremely trustworthy, from both a safety and confidentiality standpoint.
Helpfulness vs. Harmlessness
Empirically, when optimizing a model, there exists a tradeoff between helpfulness and harmlessness, which the model developer community has openly recognized. The Llama 2 paper describes the way Meta’s team has chosen to grapple with this, which is by training two separate reward models—one optimized for helpfulness (“Helpfulness RM”) and another optimized for safety (“Safety RM”). The plots below demonstrate the potential for disagreement between these reward models.
Because there exists this tradeoff between helpfulness vs. harmlessness, the desired landing point on this spectrum is a function of the use case and audience for the model. An educational model designed to serve as a chatbot for children doing their homework may land in a very different place on this spectrum than a model designed for military planning. For T&E, that means that assessing model quality is contextual, and requires an understanding of the desired end use and risk tolerance.
Vision for the T&E Ecosystem
With this shared understanding of what goes into effective test & evaluation of models, the question becomes: what is the optimal paradigm by which T&E should be institutionally implemented?
We view this as a question of localizing the necessary ecosystem components and interaction dynamics across four institutional stakeholder groups:
- Frontier model developers, who innovate on the technological cutting edge of model capabilities
- Government, which is responsible for regulating the models’ use and development by all, and uses models for its own account
- Enterprises and organizations seeking to deploy the models for their own use
- Third party organizations which service the aforementioned three stakeholder groups, and support the ecosystem, via either commercial services or nonprofit work
Making sure that these players work harmoniously, toward democratic values, and in alignment with the greater social good, is paramount. This ecosystem is represented in the graphic below:'
The Frontier Model Developers
The role of the frontier model developers in the broader T&E ecosystem is to advance the state of the technology and push the bounds of AI’s potential, subject to safeguards and downside protection. These are the players which develop new models, test them internally, and provide them to consumer and/or organizational end users.
Doing this safely starts by ensuring that each new model version is subject to regression testing, as developers iterate on improvements. This is best done via a static set of test prompts, across known areas. At major model checkpoints, they will launch exploratory evaluations to gain a more comprehensive and thorough understanding of their model’s strengths and weaknesses, which includes targeted red teaming from experts. Finally, once a model is ready for release, model developers will launch certification tests, which are standardized across various categories of risk or end use (e.g. bias, toxicity, legal or medical advice, etc.), with fewer in-depth insights, but resulting in an overall report card of model performance.
In order to ensure that all model developers are benefitting from shared learnings, there should also exist an opt-in red teaming pooling network for model developers, facilitated by a third party, which conducts red teaming across all models, aggregates red teaming results from internal teams at the model developers (and the public, where applicable), and alerts each participant developer of any novel model vulnerabilities. This is valuable because research has demonstrated that these vulnerabilities may at times be shared across models from different developers (see “Universal and Transferable Adversarial Attacks on Aligned Language Models”). At the red teaming expert level, this model should compensate participants on the basis of value attribution, from what they are able to discover and contribute, not dissimilarly from traditional software bug bounty programs.
Government
The role of government in the T&E ecosystem is twofold:
- Establishment of clear guidelines and regulations, on a use case basis, for model development and deployment by enterprises and consumers
- Establishment and adoption of standards on the use of frontier models within the government itself
The more important of these two roles is the former, as a regulator and enforcer of standards. Debates have been ongoing of late as to how to best regulate AI as a category, and the manner by which legislators should seek to balance the macro version of the helpfulness vs. harmlessness tradeoff—that is, in adopting more restrictive legislation which seeks to avoid all potential harms, vs. lighter guardrails which optimize for technological and economic progress.
We believe that proper risk-based test & evaluation prior to deployment should represent a key cornerstone for any legislative structure around AI, as it remains the best safety mechanism we have for production AI systems. It is also important to remember that determining a reasonable risk tolerance for large language models depends significantly on the intended use case, and it is for that reason that legislatively centralizing novel standards and their enforcement for AI beyond general frameworks is extremely difficult. However, we should absolutely leverage our existing federal agencies, each with valuable domain specific knowledge, as forces for regulating the testing, evaluation, and certification of these models at a hyper-specific, use case level, where risk level can be appropriately and thoughtfully factored in.
There should consequently exist a wide variety of new model certification regulatory standards, industry by industry, which government helps craft in order to ensure the safety and efficacy of model use by enterprises and the public.
Separately, as the US Federal Government and its approximately 3 million employees adopt many of these new frontier models themselves, they will simultaneously need to adopt T&E mechanisms to ensure responsible, fair, and performant usage. These will largely overlap with the mechanisms employed by enterprises as described below, but with some notable differences on the basis of domain—e.g. the Department of Defense will need to leverage T&E systems to ensure adherence to its Ethical AI Principles, or any comparable standards released in the future, and will need to optimize for unique concerns such as the leaking of classified information.
In many cases, to keep up with the pace of innovation, effective operational T&E within the government will require contracting with a third party expert organization. This is precisely why Scale is proud to serve our men and women in all corners of government via cutting edge LLM test & evaluation solutions developed alongside frontier model developers.
Enterprises
As the conduit for the majority of end model usage, the role of enterprises in the T&E ecosystem beyond the work done by the model developers (and often for uses and extensions unforeseen by the original developers) is equally important.
As enterprises leverage their proprietary data, domain expertise, use cases, and workflows to implement AI applications both internally and for their customers and users, there needs to be constant production performance monitoring. This monitoring should allow for escalation to human expert reviewers when automatically flagged examples which are outliers in existing T&E datasets arise.
And finally, as enterprises start to, in a smaller way, become model developers themselves by fine-tuning open source models (such as via Scale’s Open Source Model Customization Offering), they or the fine tuning providers they work with will need to adopt many of the same T&E procedures as the frontier model developers, including model eval and expert red teaming.
The notable difference for enterprise T&E will be the existence of industry- and use case-specific standards for model performance, which will be critical in ensuring responsible, fair, and performant use of these models in production. Certain enterprises will establish their own internal performance standards, but above and beyond that there need to exist standards on the models’ use enforced by regulatory bodies in the relevant domains, as discussed above. The achievement of these standards should be adjudicated on a regular cadence by a third party organization, and be recognized by the bestowment and maintenance of official certifications, as is the case for certain information security certifications today.
Third Party Organizations
Within this model, the fourth and final group is the set of third party organizations which contribute to this ecosystem by supporting the aforementioned three classes of stakeholders. These encompass academic and research institutions, nonprofits and alliances, think tanks, and commercial companies which service this ecosystem.
Scale falls into this final group, as a provider of human and synthetic-generated data and fine tuning services, automated LLM evaluations and monitoring, and most importantly, expert human LLM red teaming and evaluation, to developers and enterprises. Scale also acts as a third party provider for both model T&E and end user AI solutions, to the many public sector departments, agencies, and organizations which we proudly serve.
The roles of these parties may vary from policy thinking to sharing of industry best practices, and from providing infrastructure and expert support for the effective execution of model T&E to establishing and maintaining performance benchmarks. There will need to exist a diverse and robust set of organizations in order to properly support T&E.
Working with Scale
Today, we are excited to announce the early access launch of Scale LLM Test & Evaluation, a platform for comprehensive model monitoring, evaluation, and red teaming. We are proud to have helped pioneer many of these methods hand-in-hand with some of the brightest minds in the frontier model space like OpenAI, as well as government and leading enterprises, and we are ready to continue accelerating the development of responsible AI.
You can find us at DEFCON 31 this year where we are providing the T&E platform for the AI Village’s competitive Generative Red Team event as the White House’s evaluation platform of choice, and learn more about Scale LLM Test & Evaluation.
Guide to AI in Finance
Introduction
AI in finance is rapidly transforming how banks and other financial institutions perform investment research, engage with customers, and manage fraud. While traditional banking institutions are interested in incorporating new technologies, fintechs are adopting this technology more quickly as they try to catch up with larger institutions. To stay ahead of the game, larger financial institutions are investing heavily, with 77% planning to increase their budgets over the next three years, according to Scale's 2023 AI Readiness report.
Financial Services institutions are looking to AI to help them improve customer experience, grow revenue, and improve operational efficiency. Many banks have found that implementing AI requires financial investment and machine learning expertise and tools to fine-tune models on proprietary data to maximize their investments and achieve their goals. In this guide, we will identify several opportunities to apply AI in finance and how to get started so you can stay ahead of the competition.
AI for Finance: Why is it important?
Financial Institutions have much to gain from implementing AI to improve revenues and reduce costs. Accenture estimates that Financial Services companies will add over $1 Trillion in value to global banks by 2035. McKinsey also estimates that AI can deliver up to $1 trillion in value to global banks annually. This significant impact is due to the complexity of financial transactions, enormous amounts of proprietary and third-party data, increasing fraudulent activity, and the large number of customers financial institutions service.
AI provides many benefits for the finance industry:
Improved customer experience: 89% of financial services companies will use AI to improve the customer experience. AI has the potential to revolutionize finance by allowing companies to offer an array of personalized financial services at an affordable price. These companies will also be able to make it easier to learn more about the financial industry and their product offerings and reduce the friction to buying new products. Financial institutions can leverage their vast troves of data to offer personalized investment strategies, swiftly detect fraudulent activity, and efficiently assess fraud claims.
Enhanced operational efficiency: AI accelerates the automation of many activities, such as identity verification, credit scoring, loan approval, and portfolio optimization. Drastically reducing manual effort while improving accuracy, AI enables financial institutions to pass the savings to customers through better prices, making them more competitive. 56% of those surveyed in our report identified operational efficiency as a goal for adopting AI at their organization.
Increased profitability and revenue: 72% of financial services companies surveyed in our report identified growing revenue as a goal for adopting AI in their organization. With increased efficiency, financial institutions will cut costs and increase profits. Banks will increase revenue and have more stability by leveraging AI to make better investment decisions, optimize their portfolios, and mitigate risks. Wealth managers are increasing their efficiency by using AI copilots to summarize large amounts of financial data, automatically generate charts and visualizations, and create personalized portfolios leading to increased revenue at reduced costs.
Improved Fraud Detection: Consumers reported losing over $8 billion in fraud in 2022, with the actual total costs across banking being much higher. Fraud impacts banks' bottom lines and causes consumer prices to increase to offset the direct and indirect costs. AI promises to dramatically improve fraud detection and prevention capabilities by detecting trends and analyzing vast amounts of data, outperforming traditional fraud prevention solutions.
We will now explore some of the top use cases of AI in Finance.
AI in Finance: Use cases
There are numerous applications for AI in Finance, with more likely to emerge in the next few years. For this guide, we will focus on the key data-centric areas identified in our 2023 Zeitgeist AI Readiness Report:
- Investment research
- Fraud detection and anti-money laundering
- Customer-facing process automation
- Personalized assistants/chatbots
- Personalized portfolio analysis
- Exposure modeling
- Portfolio valuation
- Risk modeling
Investment research
AI has been a game-changer for financial analysts and wealth managers, completely altering the scale at which information can be gathered and analyzed. Automatically identifying, extracting, and analyzing relevant information from structured and unstructured data sources increases the quantity and relevancy of data that analysts and managers can incorporate into their processes, making them far more efficient and effective.
Deploying cutting-edge AI tools like Scale's Enterprise Copilot helps analysts and wealth managers summarize large amounts of data, making them more effective and accurate advisors. Leveraging fine-tuned large language models with access to proprietary content, advisors can quickly summarize research and other data sources, create charts and visualizations of client portfolios, and ask for insights on massive knowledge bases with source citations, enabling them to investigate that source content further when necessary. Source content includes financial statements, historical data, news, social media, and research reports. With a Copilot, each Wealth Manager becomes many times more efficient and accurate in their work, multiplying their value to a financial services firm.
Our 2023 Zeitgeist AI Readiness Report, reported that financial service companies use AI to summarize content, detect trends, and classify topics to improve investment decisions. We found that, among financial companies leveraging AI for investment research, 75% use it for content summarization, and 62% of companies use it for trend detection, which involves using AI to identify patterns in data:
Financial services companies use data from financial statements, historical market data, 3rd party databases, social media content, news, and geospatial/satellite imagery to improve their models. Using AI to analyze these disparate data sources increasingly yields improved results that help these companies gain an edge.
While many investment firms rely on fully or partially automated investment strategies, the best results are still achieved by keeping humans in the loop and combining AI insights with human analysts' reasoning capabilities.
Fraud detection and anti-money laundering
AI is proving its value to the finance industry in detecting and preventing fraudulent and other suspicious activity. In 2022, the total cost savings from AI-enabled financial fraud detection and prevention platforms was $2.7 billion globally, and the total savings for 2027 are projected to exceed $10.4 billion.
AI-enabled fraud detection is particularly critical due to the rising fraud rates. The cost of eCommerce fraud alone is projected to surpass $48 billion worldwide in 2023, compared to just over $41 billion in the previous year. Furthermore, fraudsters are becoming more sophisticated and difficult to identify using conventional, rule-based approaches, making it challenging for financial institutions to meet anti-money laundering compliance requirements.
Financial institutions can use ML algorithms to identify fraudulent transactions to spot anomalies in large datasets. A single transaction has a vast number of associated data points, such as location, time, merchant identity, and past spending behavior, and the complexity of this data poses a formidable challenge for manual or rule-based analysis.
Customer-facing process automation
Automation using AI is essential for the financial services industry to meet customer demands for better personalization and enhanced features while reducing costs. By automating repetitive, manual tasks such as document digitization, data entry, and identity verification, financial institutions can expand their offerings to maintain a competitive edge. Optical character recognition (OCR) allows for instant digitization of checks, receipts, and invoices, while AI-powered facial recognition can effortlessly determine whether there is a match between a customer's ID and a selfie while simultaneously confirming that the ID is legitimate.
Aside from chatbots and virtual assistants, ML-powered NLP is a powerful tool for extracting relevant information from documents and generating reports and personalized financial advice. Automating routine tasks reduces the number of tedious tasks to be done by humans (and the associated operating costs) and minimizes human error. The ability to generate automatic reports from data is valuable to both customers and regulators, enhancing both personalization and compliance in a scalable way. With RPA increasingly handling the more mundane tasks, skilled employees can focus on more valuable tasks, leading to greater job satisfaction.
Personalized assistants and chatbots
With the proliferation of financial services firms and offerings, providing good customer service is crucial to maintaining customer engagement and satisfaction. However, the expectation of immediate and round-the-clock assistance makes relying solely on live agents impractical and costly. Fortunately, recent breakthroughs in conversational AI, such as those demonstrated by ChatGPT, have resulted in chatbots that more closely approximate human responses. Powered by generative large language models, these chatbots excel at understanding intent and can redirect customers to human representatives when needed.
While large language models like OpenAI's GPT-4 and Anthropic's Claude work well out of the box, many financial institutions find that they need to customize models to get them to provide the best responses and align with their policies. Techniques like fine-tuning models on proprietary data, prompt engineering, and retrieval help elevate a base model from acceptable responses to a superior customer experience. Many financial institutions leverage their vast data to offer AI-enabled personalized service and guidance. Institutions can provide customers with assistant-like features, including categorizing expenditures, suggesting savings goals and strategies, and providing notice about upcoming transfers. AI can offer personalized financial advice and guidance based on individual customer profiles and preferences and assist users with budgeting, financial planning, and investment decisions.
Financial institutions also leverage AI-powered copilots like Scale's Enterprise Copilot to assist wealth managers internally. These copilots enable wealth managers to extract insights from internal and external documents, enabling informed decisions quickly and efficiently based on large volumes of data. By incorporating copilots into their workflow, wealth managers can significantly enhance their productivity and deliver more valuable insights. These copilots use fine-tuned base models with even greater access to proprietary data than customer-facing chatbots since copilots are meant for authorized employees. This means the copilots are even more powerful, providing a productivity boost for wealth managers while increasing customer satisfaction as investors get personalized advice more quickly.
Personalized portfolio analysis
Robo-advisors are gaining popularity as inflation rates soar, providing a simple and accessible option for passive investing. These automated wealth management platforms use AI to tailor portfolios to each customer's disposable income, risk tolerance, and financial goals. All the investor needs to do is complete an initial survey to provide this information and deposit the money each month - the robo-advisor picks and purchases the assets and re-balances the portfolio as needed to help the customer meet their targets.
With increasingly more capable machine learning models, robo-advisors can analyze more data and provide more personalized investment plans. These models can analyze individual portfolios and provide insights into asset allocation, risk diversification, and performance evaluation. They can even suggest adjustments to optimize portfolio performance based on the customer's goals, risk tolerance, and market conditions. Also, robo-advisors can adapt to changing market dynamics and provide real-time portfolio analysis.
Many robo-advisory platforms also support socially responsible investing (SRI), which has proven attractive for younger investors. These systems can allocate investments according to individual preferences, including or excluding certain asset classes in line with the customer's stated values. For instance, a robo-advisor can automatically curate a personalized portfolio for an investor who wishes to support companies that meet environmental, social, and governance (ESG) criteria or exclude those that sell harmful or addictive substances.
Robo-advisors appeal to those interested in investing but lack the technical knowledge to make investment decisions independently. Much cheaper than human asset managers, they are a popular choice for first-time investors with a small capital base.
Exposure modeling
Exposure modeling estimates the potential losses or impacts a financial institution, or portfolio may experience under different market conditions. It aims to quantify a portfolio's potential vulnerabilities and sensitivities to various risk factors. Exposure modeling involves analyzing the relationship between the portfolio's holdings and different market variables to assess how changes in those variables can affect the portfolio's value or performance.
Financial institutions are increasingly using AI for exposure modeling in finance to assess and manage various types of risks that financial institutions face. Exposure modeling involves estimating the potential losses a firm may experience under different market conditions, such as changes in interest rates, credit defaults, or market volatility. Because AI can model and assess the potential financial exposure to risks such as market fluctuations, credit defaults, and economic events, as well as analyze historical data, market trends, and external factors to estimate potential losses or gains, it's a valuable tool for helping financial institutions make informed decisions regarding risk management and hedging strategies. Optimizing strategies using instruments like equity derivatives and interest-rate swaps may allow institutions to optimize portfolios and offer better prices to customers.
Machine learning can be incorporated into exposure modeling in numerous ways. By analyzing vast amounts of historical financial data to identify patterns and correlations that may be difficult for humans to detect, models can learn and identify potential risks associated with specific market conditions or events. These models can also simulate various risk scenarios and generate probabilistic outcomes, allowing financial institutions to evaluate the potential impact of different market shocks on their portfolios. It may help uncover hidden risks that traditional models may overlook.
By leveraging financial models, institutions can make faster and more informed decisions in response to changing market conditions. To extract relevant insights, They can use models to analyze unstructured data sources, such as news articles, social media feeds, and research reports. By understanding and processing textual information, these models can identify emerging risks, sentiment trends, or market-moving events that could impact exposure levels.
Portfolio Valuation
Valuing a portfolio is crucial for assessing its performance, making investment decisions, and reporting accurate financial information to stakeholders. However, manual valuation can be challenging as various factors influence portfolio value, including market data, pricing models, time horizon, and allocation of diverse investment types such as stocks, bonds, mutual funds, derivatives, and other securities.
Many financial institutions are incorporating AI into their portfolio valuation processes to address these challenges. Financial institutions can enhance accuracy, efficiency, and decision-making with ai-powered asset valuation that is automated and accurate. These models can instantly consider factors such as historical market data, current market behavior, pricing models, proprietary research, and performance indicators.
By leveraging large volumes of financial data, including historical market data, company financials, economic indicators, and news sentiment, models can help companies identify patterns, correlations, and trends that impact portfolio valuation. Financial institutions can also integrate alternative data sources such as satellite imagery, social media, and consumer behavior data into portfolio valuation models to enrich the analysis.
Risk modeling
Accurate risk modeling is critical for financial institutions. These institutions must employ risk modeling to assess and quantify overall risk by analyzing exposure, probability, and potential impact. Risk modeling aims to capture and measure the various types of risks the institution faces and to provide a comprehensive view of the potential downside or volatility associated with those risks.
Because of the complexities involved in risk modeling, this is an area where AI can have a substantial impact. AI enables financial institutions to develop more capable risk models based on large quantities of data, identifying complex patterns that are difficult for humans to replicate. Machine learning models can yield more accurate predictions, allowing financial services firms to manage risk more effectively.
An important subset of risk modeling is credit scoring. Credit scoring powered by machine learning has proven invaluable for the finance industry, enabling rapid and accurate assessments with reduced bias. The key is using AI to assess potential borrowers based on alternative data such as rent payment history, job function, and financial behavior. Not only does this result in more accurate risk analysis by considering important indicators, but it also enables potential borrowers without a credit history to be assessed.
AI-based credit scoring has other clear advantages, such as reducing manual workload and increasing customer satisfaction with rapid credit card and loan application processing.
How to implement AI in finance
When companies implement AI for any use case, it's essential to establish a carefully considered strategy. Finance companies should tie their AI goals to business problems and develop a solid data strategy. In Scale's annual Zeitgeist: AI Readiness Report, we surveyed over 1,600 ML practitioners and business leaders and found that an organization's goals shape the effectiveness of its AI implementation. Finance companies must ensure that the goals of an AI implementation, such as growing revenue, improving operational efficiency, or enhancing customer experience, are aligned with company priorities.
We suggest adhering to the following steps throughout the implementation process:
- Prioritize your use cases: What are the top challenges that you are facing, and what are your company's top priorities? Are you focused on increasing revenue, improving customer experience, or improving operational efficiency? Do you need to improve investment research, fraud detection, or portfolio valuation? Dig deep into defining the problems that you are trying to solve.
- Define a robust data strategy: Once you have prioritized your use cases, the most important thing you can do is to define a robust data strategy. Any AI solution is only as good as the data available. While off-the-shelf base models are impressive at general tasks, they don't perform well on specific finance tasks and don't have access to proprietary data. To improve performance on these tasks, open-source or commercial foundation models must be fine-tuned on your proprietary data. Your internal knowledge-base data, including research reports, historical market data, and customer data, must be accessible to the models for fine-tuning and retrieval. Determine what data you have, the formats in which you need that data, and what it will take to clean and standardize your existing data and improve your data collection mechanisms. For knowledge retrieval, you will need to chunk your text data, convert it into embeddings, store it in vector databases, and perform a similarity search to retrieve that data for the model to incorporate in responses. Doing this correctly and at scale is challenging, and this is a constantly evolving space, so you will need to stay up to date with the latest research and open-source and commercial capabilities.
- Baseline internal capabilities: As machine learning technology advances rapidly, it is essential to understand your internal capabilities. Do you have the internal machine learning expertise to implement an AI strategy properly? Do you have a data strategy and the capabilities and tools to implement that strategy in the near term? Do you have the partners to help you implement your strategy effectively? Before you make significant investments, it is critical to understand this clearly.
- Consider security: Companies in the financial industry regularly work with a variety of confidential and proprietary data. Popular cloud-based models can leak confidential data and pose other security and safety risks, so it's crucial to ensure you're protecting your data. Only use tools and applications that align with your company's security policies.
- Build a "crawl, walk, run" methodology: When building an AI solution for finance, start small by addressing a specific challenge or customer need. Then, innovate quickly and test various solutions using proof of concept implementations or product pilots. Expand your solution to incorporate new use cases based on their impact on company priorities.
Read the guide Generative AI for the Enterprise: From Experimentation to Production for more detailed steps on implementing Generative AI.
Conclusion
This guide covered the most prominent use cases and applications for AI in finance. We covered investment research, fraud detection and anti-money laundering, customer-facing process automation, personalized assistants/chatbots, personalized portfolio analysis, exposure modeling, portfolio valuation, and risk modeling.
As AI continues to shape the financial services landscape, it's crucial that finance companies rapidly invest in AI innovation. Fintechs and traditional banking institutions are investing in this technology, and it promises to give them an edge in revenue growth, improved customer experiences, and operational efficiency. When developing AI solutions, you should follow best practices by following frameworks that emphasize identifying desired outcomes, ensuring you have implemented a solid data strategy, and then experimenting and implementing scalable AI solutions. Companies should tie their goals for AI in finance to business problems and identify performance metrics based on these goals. New models are developing rapidly, and companies in the finance industry need to adapt to new technology quickly.
If you're interested in learning more about how to apply AI for your financial services business, Scale EGP (Enterprise Generative AI Platform) provides a full-stack generative AI platform for your enterprise. For additional details on how to implement Generative AI, read the guide Generative AI for the Enterprise: From Experimentation to Production.
Guide to AI for Insurance
This guide covers the main applications of Artificial Intelligence for the Insurance Industry.
Introduction
With the explosion of AI across every industry, the hyper-competitive business space within the insurance industry is making the adoption of AI a foundational necessity. Improved operational efficiency and enhanced customer experience are the key outcomes of this technology. Today, 87% of surveyed insurers already see their companies invest $5 million or more in AI technology each year, and 74% of insurance executives plan to increase their investment in AI. However, most (78%) of these insurers lack a clear, documented strategy and the in-house capabilities to operationalize AI at production levels.
Insurance companies worldwide seek to leverage AI for claims processing, fraud detection and prevention, insurance pricing, and overall operations management. With this technology and a well-defined strategy, insurance companies can scale their services, provide enhanced customer experiences, and capture game-changing improvements.
AI for Insurance: Why is it important?
McKinsey estimates that AI can deliver $1.1 trillion in potential annual value for the insurance industry across various functions and use cases. With recent advances in the field, the potential applications for AI are enormous. For example, large language models like ChatGPT have broken into the mainstream, making it possible to automatically identify the semantic intent of conversation and generate accurate, effective human-like responses through language-based applications. Additionally, the increased amount of raw data available through Internet of Things (IoT) devices and autonomous vehicles has promoted the development of even more complex models. Applying these models, companies have been able to develop improved safety features to reduce accident frequency, as well as to better assess and adjust rates following insurance events. Additionally, early adopters have been leveraging AI to provide personalized offers to individual customers.
Insurers must be able to process immense volumes of data from disparate sources and make complex predictions every day. As such, the industry has virtually unlimited opportunities to leverage these new technological developments. Insurers can significantly accelerate processes by using AI to aid in ingesting raw data and generate predictions at a new level of speed and volume.
Strategically, adopting AI into their business can be highly beneficial to insurance companies. These benefits include:
- Increasing efficiency: AI increases the efficiency of time-consuming processes such as underwriting, claims management, and customer service.
- Improving accuracy: By automating tasks, AI reduces human errors tied to manual processes.
- Enhancing fraud detection: AI allows companies to stay current with advancements in fraud prevention and in detecting sophisticated fraud patterns.
- Improving customer experience: By leveraging consumer data, insurers can use AI to provide customers with customized, accurate coverages and pricing.
Conversely, deciding not to apply AI poses a significant risk to insurers, preventing them from keeping up with changing customer needs and allowing for uncompetitive operational inefficiencies. These risks and challenges include:
- Manual processes are too slow: Many insurance systems are manual, paper-based, and require much human involvement, leading to long wait times and expensive delays.
- Premium rates and coverage offerings are not effectively customized: Without providing customers with customized premium rates, the premiums they are charged may not be accurate or competitive. Customizing these policies allows insurance companies to adapt to the specific customer and their needs.
- Fraud is prevalent: As fraud becomes even more sophisticated, it is more complex and costly for insurance companies to combat. Manually identifying fraud simply isn’t scalable, and it drives an uncompetitive rate structure.
- Compliance with regulations is difficult: Insurance companies must comply with myriad regulations to adequately protect customer data, including personally identifiable information and health data.
In this guide, we’ll explain how the insurance industry can use AI to solve these challenges. We’ll discuss the top uses for AI and provide a detailed overview for implementing AI within your organization.
AI for Insurance: Top use cases
There are numerous applications for AI in insurance. In this guide, we will focus on the top three use cases as identified in our 2023 Zeitgeist AI Readiness Report:
- Accelerated claims processing
- Claim fraud detection and prevention
- Risk assessment and underwriting
AI-Accelerated Claims Processing
During claims processing, insurers must check a claim for information, validation, and justification. Once that claim is approved, the insurance company then proceeds to process the payment. However, there are challenges to performing claims processing efficiently. Claims processing is often performed manually, making it prone to errors and inefficiencies. This drives up operating costs and creates regulatory and competitive challenges. Because of the complexity and volume of data involved in processing claims, this is a key area of opportunity for AI innovation.
Recent advancements in Generative AI technology have made it possible to democratize internal access to insurance companies' policies, documentation, and claims information. By putting a comprehensive knowledge base at the fingertips of claims adjusters to query for case details, company guidelines, and more, insurers are accelerating settlement times and improving adjuster decision-making.
Deploying cutting-edge AI tools like Scale’s Enterprise Copilot helps deploy these Generative AI tools quickly and securely, customized on an insurer's proprietary and sensitive data. Insurers can stand up multiple versions of these copilots, with enterprise-grade security and role-based access controls to ensure the right stakeholders can access the right data. Further customizing these solutions with domain-specific fine-tuned models helps insurers build a proprietary and competitive asset that enables significant operational and settlement efficiencies.
There are several other areas of the claims process where insurers can leverage AI, including initial claims routing, claims triage, and claims management audits. Examples include:
- Accelerating administrative processes: By using AI to automatically route claims, these claims can be resolved quickly, providing optimal value. Additionally, damage severity can evaluated programmatically from claims reports. Claims can even be validated against external data sources, such as weather reports.
- Claim and customer segmentation: Claims and customer information can be automatically segmented and loaded into an intelligent search engine, making information easier to organize and find, and can be utilized towards pricing and growth efforts.
- Create new insurance policies: Insurers can use AI to automatically create new, customized insurance policies based on internal, customer, third-party, and public data. This allows insurers to deliver a tailored range of insurance products.
- Unlock insights: By using AI to analyze claims data and attributes, managers can better understand claims patterns, guiding managers to take appropriate actions.
This solution involves implementing AI-powered intelligent document processing to review the claim, verify policy details, and perform fraud detection. Computer vision can also be used to assess the cost of damage by analyzing input data such as images and videos. These claims can then be stored, allowing insurers to easily search through historical claims using large language model-based search. After the claim is approved, the process can automatically issue electronic payments.
AI solutions increase the operational efficiency of an insurer, resulting in a significantly improved competitive position. To help implement improved predictions of claims outcomes, we offer the industry Scale’s Claim Intelligence. This platform can effectively manage the complexity of claim breakdown, ingest and process data from multiple models, and provide detailed data and analysis through its intelligence engine.
Claims Fraud Detection and Prevention AI
Preventing and detecting claims fraud is another difficult and time-consuming process. Data-based deception and malicious agents are increasing across the industry, and malicious content is becoming more sophisticated and very hard to detect.
Insurance companies often employ risk modeling for fraud detection to address this challenge. With AI, insurers can rapidly identify the variables and factors that pose the most risk and more effectively prevent fraud. Accurate fraud detection and prevention improves insurer competitiveness and ensures accurate, legitimate payouts. This risk-modeling technology often involves various techniques, including text analysis, logistic regression, and predictive analytics.
With this advanced technology, companies can detect unseen patterns and markers of fraudulent claims. This enables these companies to become a more challenging target and increase their efficiencies while helping to manage loss ratios.
Risk Assessment and Underwriting with AI
Insurers must evaluate and analyze the risks involved in insuring people and assets when providing insurance pricing. By determining the risk of issuing coverage to a person or business, the insurer can set the insurance premium they charge.
Without AI, this process is historically inefficient since application processing requires extracting information from detailed documents. When insurance companies rely on optical character recognition, this process is manual, time-consuming, and error-prone. Customer documents often include different data formats, requiring manual review. Additionally, because regulations for document processing change frequently, insurers often need to update their processes.
Scale’s AI technology can manage a variety of tasks such as determining optimal rates for customers for risk management, reducing the time needed to introduce new pricing frameworks, and building data-informed insurance policies. AI can automate the process of insurance underwriting using machine and deep learning models. Companies will extract relevant data from insurance documents with intelligent document processing. AI is also applicable to improve customer service – Automated customer service apps, also called conversational AI, can handle policyholder interactions and create personalized quotes. AI solutions can predict premiums from previous risk assessments to make risk assessment more precise and enable predictive modeling for dynamic pricing.
How to implement AI for Insurance
When companies implement AI for any use case, it’s important to follow a framework based around AI strategy. This framework involves identifying use cases, prioritizing approaches based on their impact, and understanding the necessary technology. Most importantly, insurance companies should tie their AI strategy to corporate strategy. Our annual Zeitgeist: AI Readiness Report, which surveyed over 1,600 ML practitioners and business leaders, found that an organization’s goals shape the effectiveness of its AI implementation. Insurers must ensure the goals of an AI implementation, such as reducing lost costs from claims payout, improving top-line growth, or enhancing customer experiences, are aligned with company priorities.
By building an AI solution incrementally, your company can design AI that serves its specific objectives. We suggest adhering to the following steps throughout the implementation process:
- Outline key challenges: What are your company's challenges? Are they claims-related, top-line growth or growth against other insurance players?
- Develop a business strategy for implementation: What do you specifically need to accomplish with the AI solution, and how will you measure its success? What are the key performance indicators you intend to monitor? Who is accountable for the success of the program?
- Understand the available technology: What are the best and most current AI-related technologies that address these challenges and business strategies?
- Start with a strong data strategy: Before diving into building a solution, consider the data you have available, as well as any data collection you may need to perform. Determine the type, quantity, and quality of data that you have. If you don't have strong in-house data science or AI expertise in-house, consider working with an experienced third party to help you define and execute your data strategy.
- Build a “crawl, walk, run” methodology: Start small by addressing a specific challenge or customer need when building an AI solution. Then, move quickly and conduct short-term tests on various solutions using proof of concept implementations or product pilots. From there, you can expand the mission to incorporate additional use cases that align with company priorities.
Read the guide Generative AI for the Enterprise: From Experimentation to Production for more detailed steps on implementing Generative AI.
Conclusion
This guide covered the most prominent use cases and applications for AI in insurance. The AI revolution is underway, and it’s already changing the competitive landscape. It’s arguably essential that insurance carriers meet this moment by rapidly investing in AI innovation and process-model evolution. Achieving key corporate objectives will require a well-planned strategic investment in cutting-edge AI innovation.
If you have found this guide informative and want to learn more about how to rapidly and effectively apply AI, Scale AI stands ready to support your efforts. Scale EGP (Enterprise Generative AI Platform) is a proven product that provides cutting-edge Generative AI solutions to generate enormous business value. Scale Spellbook is a great way to get started with building, comparing, and deploying large language model apps. Scale Claims Intelligence is designed to help you predict the future outcome of claims to help you streamline claims management.
Most importantly, start learning more about how AI can help your company and build an AI strategy today.
Generative AI for the Enterprise: From Experimentation to Production
Generative AI is transforming how employees work and customers engage with enterprises. We created this guide to help you understand what Generative AI is and how you can use these models to unlock the power of your data and accelerate your business.
Recent developments in Generative AI have ushered in AI's industrialization era. With ChatGPT reaching 100 million monthly active users within two months of launch, boards and C-suites everywhere are elevating Generative AI to the top of their leadership agendas. We are at the beginning of a technology revolution that will be as impactful as the Internet in the fullness of time.
For the enterprise, Generative AI can dramatically improve employee productivity and transform how companies engage with their customers, from personalized marketing to empathetic, efficient automated customer service. This space is evolving at light speed, and enterprises that fail to adopt Generative AI quickly will be left behind.
According to Scale's 2023 AI Readiness Report, over half of the executives indicated that advances in generative models inspired them to accelerate their existing AI strategy, while over 70% indicated that their companies would "significantly" increase their AI investments each year over the next three years.
At the tip of the spear are companies that have already come to market with Generative AI:
- General Motors is developing an in-car assistant powered by large language models, customized with knowledge of their cars to help drivers change flat tires or evaluate diagnostic lights.
- Morgan Stanley is enabling their wealth managers with AI copilots built on top of LLM's fine-tuned on internal documents.
- Coca-Cola is engaging digital artists to produce Generative AI-powered content with elements of their branding.
These companies are the exception, as we found that while 60% of respondents are experimenting with generative models or plan on working with them in the next year, only 21% have these models in production. And even the enterprises moving quickly to adopt this technology are finding it challenging to move from experimentation to production.
Out-of-the-box commercial models are powerful, but initial experimentation leaves many companies wondering how to execute the heavy customization and fine-tuning required to meet enterprise-level performance and reliability standards. Model hallucinations (confidently made-up facts not found in the training data) and brand safety concerns make model fine-tuning, observability, and monitoring critical but often challenging to manage at the enterprise scale. And many business leadersare concerned about the security and privacy of their proprietary data and IP when using commonly available Generative AI solutions. Addressing these challenges will be critical to enterprises' ability to scale their experiments and deliver tangible ROI.
What Can Generative AI do for Enterprises?
Generative AI enables companies to quickly build new products or services, improves customer experience with personalized interactions, and increases employee productivity. Integrating Generative AI with plugins enables it to take actions such as submitting orders. Retrieval enables LLMs to access enterprise knowledge bases and summarize and cite proprietary data. Let's look at a few examples of how enterprises use Generative AI today.
Generative AI in Financial Services
Financial services companies are building assistants for investment research that analyze financial statements, historical market data, and other proprietary data sources and provide detailed summaries, interactive charts, and even take action with plugins. These tools increase the efficiency and effectiveness of investors by surfacing the most relevant trends and providing actionable insights to help increase returns.

Generative AI in Retail & e-commerce
Retail and e-commerce companies are building customer chatbots that provide engaging discussions, acting as personal assistants to every shopper. They also generate stunning product imagery, social media ads, and lifestyle pictures at scale in seconds.

Generative AI in Insurance
Insurance companies use Generative AI to increase the operational efficiency of claims processing. Claims are often highly complicated, and Generative AI excels at properly routing, summarizing, and classifying these claims. Adjusters are using copilots to query claims data, saving them time from sifting through a large amount of documentation.

We have covered a few representative use cases of Generative AI in the enterprise, but we have only scratched the surface of what is possible. Next, we will cover how companies are deploying this Generative AI.
How are Enterprises Deploying Generative AI Today?
To properly adopt Generative AI for the enterprise, you first need a solid understanding of the Generative AI stack. At the base are foundation models, such as OpenAI's GPT-4, Google's PaLM 2, Cohere's Command model, or Anthropic's Claude in the case of LLMs or Stability AI's Stable Diffusion for image generation models. These models provide the base or foundational capabilities for Generative AI applications. Next is the data engine that provides the data customization and fine-tuning required to enable the base model to use proprietary enterprise data properly. Then a development platform is needed to build LLM apps, compare prompts and model variants, and deploy applications to production.

Typically, enterprise-grade deployments of Generative AI involve some degree of internal development of your own applications. Companies typically do this so they can customize and fine-tune models to optimize performance on their specific use cases, improve security and safety, and ensure observability and reliability.
Customize and fine-tune models and apps for peak performance

Enterprises have unique needs that require extensive fine-tuning and prompt engineering of base foundation models. Open-source and commercial models are great generalists, but for enterprise use cases, they are poor specialists - especially ones that require "knowledge" of domain- or company-specific data. Base models are trained on publicly available internet data, not on a law firm's private documents, a wealth manager's research reports, or any company's internal databases. This specific data and context is the key to helping a model go from generic responses to actionable insights for specific use cases.

Small fine-tuned models are cheaper, faster, and perform better at specific tasks than base foundation models. For example, Google's Med-PaLM 2 is a language model fine-tuned on a curated corpus of medical information and Q&As. Med-PaLM2 is 10 times smaller than GPT-4 but actually performs better on medical exams.

Source: Towards Expert-Level Medical Question Answering with Large Language Models, https://arxiv.org/abs/2305.09617
Another example is Vicuna-13B, a chatbot trained by fine-tuning Meta’s open-source LLaMA model. Vicuna is a 13B parameter model fine-tuned on approximately 70K shared user conversations. Vicuna is more than 13 times smaller than ChatGPT and provides the same response quality as ChatGPT in over 90% of cases.
GOAT is another fine-tuned LLaMA model that outperforms GPT-4 on arithmetic tasks, achieving state-of-the-art performance on the BIG-bench arithmetic sub-task.
Every organization has business-critical tasks that rely on proprietary data and processes and will benefit from a fine-tuned vs. base foundation model. While ChatGPT can provide general tips for a bank's customer support representative, a model fine-tuned on the transcripts of actual calls from a bank's customers can guide reps on specific actions for callers' concerns while following company policies - like the fastest path to resolve a billing dispute or the bank's best checking account option for a given customer segment.
In addition, enterprises may want greater control and flexibility over which models they use and when. Being able to compare multiple models can improve both performance and cost, rather than being locked into using one model or provider for the use case (or many use cases across a business).
Improve security and safety

Off-the-shelf applications typically require data to pass through the app provider's cloud. A custom-built application can remain in an enterprise's virtual private cloud - or even on-premises - for cases where data security is critical.
With a purpose-built application, data stays within the existing environment, so access control of any deployed LLM app can mirror existing role-based access controls.
Ensure observability and reliability

Without rigorous evaluation and monitoring capabilities, generative models are prone to hallucinations and can provide false, harmful, or unsafe results. Companies face significant risks to their brand, especially when deployed in customer-facing settings or when handling sensitive information.
By customizing an enterprise app, your teams can define how they measure the performance of your applications and set up appropriate monitoring processes.
In addition to monitoring traffic and latency, you must consider operational priorities when setting up these monitoring processes. For example, suppose a financial firm has created an insider trading detection app powered by Generative AI. The security and compliance team will need to be immediately alerted about the detection of insider trading and any misclassification. The only way to achieve this is by directly embedding real-time monitoring and logging of prompts and model responses into the custom app. The security and compliance team can then take action from these alerts to prevent further damage.
As we mentioned, organizations need to consider each layer of the stack from Generative AI applications, a robust development platform, a data engine for customizing and fine-tuning models on proprietary data, and base foundation models.

Below we lay out the key considerations for your build vs. buy decision for each layer of the stack.
Applications

Description: These interfaces allow customers or employees to interact with Generative AI models, such as a chat agent to ask questions or a copilot that makes suggestions based on what the user is working on.
Build: Building your own application is best when performance is dependent on access to proprietary data, and when data, model inputs, and/or outputs are highly sensitive.
Buy: Buying these apps is most appropriate for a fast start on less sensitive or generic use cases where proprietary data is not required.
Development Platform

Description: Companies seeking to build their own apps need the tooling to experiment with, develop, and deploy these apps. This tooling helps teams to compare generative models, fine-tune them, play with prompts, and then deploy apps to production.
Build: Typically, building an in-house development platform is limited to those seeking to sell it to customers, such as Google Vertex. Some companies build their own internal platforms strictly for their own use, but these platforms are difficult to maintain and costly to keep up to date, particularly in a fast-moving field.
Buy: Buying a development platform frees your resources to focus on core competencies and building valuable applications for your business instead of standing up another piece of infrastructure that needs to be maintained. Many open-source and commercial solutions on the market today are robust, reliable, and cost-effective for building and deploying Generative AI applications, including Scale Spellbook.
Data Engine

Description: A data engine helps teams to collect, curate, and annotate data, so their Generative AI models can produce high-quality outputs using this high-quality data. This typically includes human experts to validate the data and the tooling to help them do so efficiently and effectively.
Build: A substantial investment is required to build in-house tooling to fine-tune models and assemble and train a workforce of human experts to produce and rank data at a very high quality.
Building a data engine is only appropriate for extremely sensitive use cases with strict data privacy requirements.
Buy: To accelerate deployment time, most enterprises should consider buying to access state-of-the-art tooling immediately and leverage vetted human experts to improve their model performance.
There are also options to buy even for the most sensitive use cases, where the data remains within an organization. External partners can be given access to VPCs with the appropriate role-based access controls to perform the annotation, customization, and fine-tuning work needed to improve model performance.
Base Foundation Models

Description: At the core of any Generative AI application is one or more "base" models such as OpenAI's GPT-4, Anthropic's Claude, Cohere's Command model, or open-source models like T5-FLAN, StableLM, or BLOOM.
Build: Companies may choose to train their own base model (for example, BloombergGPT) when the performance of existing models - even with fine-tuning - is insufficient to meet their needs.
Buy: Commercial providers like OpenAI, Anthropic, and Cohere provide API access to their pre-trained models, typically charging based on usage.
There are tradeoffs to both commercial and open-source models, so it is essential to carefully consider your use case and the models available before investing. For example, some commercial models, such as GPT-4, do not offer fine-tuning. In contrast, open-source models provide this capability and more flexibility but also require the company to host the model themselves.
How do I Get My Team Started on Deploying Generative AI Applications?
1. Prioritize your use cases
A few low-risk, simple use cases are well-served by buying applications directly from the market. However, use cases that drive specific business outcomes and require high-quality, consistently reliable outputs justify the upfront investment to supply the tooling and data required for custom applications.
Across functions, we are seeing rising demand for Generative AI to improve customer experience, increase operational efficiencies, introduce new product capabilities and new products, and improve workforce capabilities. Many of our customers and partners have identified that they are suffering losses from employee attrition and poor performance, dissatisfied customers, or complex processes that are not fully optimized. To help prioritize their Generative AI use cases for the most significant impact, our customers and partners are asking themselves:
- What are our largest cost drivers, and can any of these costs be reduced with automated retrieval, summary, or generation based on our data?
- Where in our business are we processing large quantities of documents?
- How are we organizing our internal knowledge bases today?
- How effective is our customer-facing support?
- Are there roles where training and onboarding of new hires is a bottleneck?
- Where is our organization limited by the availability of resources such as software engineers or data scientists whose work could be accelerated?
- How many data scientists write queries or build dashboards, and how much time do they spend doing so?
Once you have this list of potential use cases for Generative AI, you can prioritize them for development based on the total value at stake to your business, the feasibility of deployment (e.g., technical and operational complexity, change management required, and cost), and the potential risks including risks to your brand, customers, or security. Initially, focus on a few high-impact, high-feasibility, and relatively low-risk use cases as pilots, followed by other use cases after your organization gathers insights from the initial pilots.
2. Scope your requirements
Assemble the relevant technical and business leaders from your organization along with any external experts to assess:
- What does "good" look like in terms of performance?
- What does "bad" look like? What kinds of risks are involved?
- Where will the application be hosted?
- Will we explore and compare multiple generative models?
- How much customization of a broadly-trained model is required?
- What type of data will we use to fine-tune the model? Does this data need to stay in our VPC?
3. Baseline internal capabilities
At a minimum, deploying bespoke applications will require a small team of ML and software engineers with dedicated capacity over a few weeks to experiment with and assess the use case. However, it may also require external talent to inject expertise and upskill internal talent. Building bespoke applications doesn't require hiring large internal teams. Depending on your use case and desired timelines, investing in external expertise to get your teams up and running more quickly could be worthwhile, especially given the tight labor market for expert profiles in this space.
In particular, for use cases that lean heavily on fine-tuning, external workforces and tooling can dramatically accelerate progress for initial use cases. Most enterprises we've seen do not have data "ready to go" for fine-tuning, so standing up the teams to collect, clean, and annotate such data can be a significant lift.
4. Plan for moving from experimentation to production
With the information available to you on current use cases, internal vs. external capabilities, and requirements, you can get a small internal team mobilized on experimentation in just a few weeks. However, moving this effort towards real, sustained impact will require a vision for how to move these experiments into production. You must decide whether to build or buy for every layer of the stack, as well as a plan for organizational impact in the next one to two years and how you will evolve your capabilities in response.
In addition to technical considerations, here are other questions to consider:
- Once deployed, what changes to organizational processes will be required to sustain these use cases? For example, will approval chains need to change?
- How will Generative AI change the way teams or business units interact with one another? What changes to organizational structure might be required?
- How can we stay ahead of regulatory and compliance implications?
- How will we collect data and implement a data engine for continued model improvement?
Conclusion
The last few months have been an exciting period of innovation and experimentation with Generative AI. Enterprises are now grappling with how to compete while responsibly and effectively deploying Generative AI. This journey to scale such applications presents complex challenges for any organization, but with an intentional plan, the proper tooling, and talent, companies should feel well-positioned to become first movers in harnessing the power of Generative AI.
At Scale, our mission is to accelerate the development of AI applications. Since 2016, we have been the trusted partners of the world's most ambitious AI/ML teams, working together to solve their most complex problems. With our Enterprise Generative AI Platform (EGP) platform, we help businesses accelerate the development of Generative AI applications by providing secure and scalable infrastructure, machine learning expertise, and the most sophisticated data engine on the market.
Guide to Large Language Models
Large language models (LLMs) are transforming how we create, understand our world, and how we work. We created this guide to help you understand what LLMs are and how you can use these models to unlock the power of your data and accelerate your business.
What are Large Language Models?
Large language models (LLMs) are machine learning models trained on massive amounts of text data that can classify, summarize, and generate text. LLMs such as OpenAI’s GPT-4, Google’s PaLM 2, Cohere’s Command model, and Anthropic’s Claude, have demonstrated the ability to generate human-like text, often with impressive coherence and fluency. Until the arrival of ChatGPT, the most well-known examples of large language models were GPT-3 and BERT, which have been trained on vast amounts of text data from the internet and other sources. Generally, LLMs are capable of a wide variety of natural language processing (NLP) applications, including copywriting, content summarization, code generation and debugging, chatbots, question answering, and translation.
At a high level, large language models are language prediction models. These models aim to predict the most likely next word given the words provided as input to the model, also called prompts. These models generate text one word at a time based on a statistical analysis of all the “tokens” they have ingested during training (tokens are strings of characters that are combined to form words). LLMs have a wide array of capabilities and applications that we will explore in this guide.
Despite the significant progress in making these models more capable and widely accessible, many organizations are still uncertain about how to adopt them properly. From the Scale Zeitgeist 2023 report, we found that while most respondents (60%) are experimenting with generative models or plan on working with them in the next year, only 21% have these models in production. Many organizations cited a lack of the software and tools, expertise, and changing company culture as key challenges to adoption. We wrote this guide to help you get a better understanding of large language models and how you can start adopting them for your use cases.
Why are Large Language Models important?
Large language models have revolutionized natural language processing and have a wide range of applications. These models are transforming how we create, understand our world, and conduct business. Large language models help us write content like blogs, emails, or ad copy more quickly and creatively. They enable developers to write code more efficiently and help them find bugs in large code bases. Developers can also integrate their applications with LLMs using English-language prompts without needing a machine learning background, accelerating innovation. Large language models summarize long-form content so that we can quickly understand the most critical information from reports, news articles, and company knowledge bases. Chatbots are finally living up to their promise of enabling businesses to streamline operations while improving customer service.
Large Language Models are more available to a wider audience than ever, as companies such as OpenAI, Anthropic, Google, Stability AI, and Cohere, provide APIs or open-source models to be used by the larger community. Additionally, the talent pool of machine learning engineers is growing and new roles such as "prompt engineer" are becoming popular (source).
Due to the large amount of data they have been trained on, large language models generalize to a wide range of tasks and styles. These models can be given an example of a problem and are then able to solve problems of a similar type. However, out of the box, these models are poor specialists. To take full advantage of LLMs, businesses need to fine-tune models on their proprietary data. For example, consider a financial services company looking to perform investment research. As base models only have access to outdated publicly available data, they will provide generic information about stocks or other assets but often will be unable or will flat-out refuse to provide investment advice. Alternatively, a fine-tuned model with access to private research reports and databases is able to provide unique investment insights that can lead to higher productivity and investment returns.
When used properly, LLMs help organizations to empower their employees, increase their efficiency, and are the foundation for better customer experience. We will now explore how these models work and how to deploy them properly to maximize the benefits for your business.
Common Use Cases for Large Language Models
LLMs are capable of a wide variety of tasks, the most common of which we will outline here. We will also discuss some domain-specific tasks for a select few industries.
Classification and Content Moderation
Large language models can perform a wide range of natural language processing tasks, including classification tasks. Classification is the process of assigning a given input to one or multiple predefined categories or classes. For example, a model might be trained to classify a sentence as either positive or negative in sentiment. Beyond sentiment analysis, LLMs can be used to detect the reasons for customers' calls (no more needing to sit through long phone menus to get to the right agent) or properly organize user feedback between UX suggestions, bug reports, or feature requests.
Content moderation is also a common application of LLMs classification power. A common use case is flagging if users are posting toxic and inappropriate content. LLM can be fine-tuned to quickly adapt to new policies, making them highly versatile tools for content moderation.
Classification: Classify each statement as "Bearish", "Bullish", or "Neutral"

Text Generation
One of the most impressive capabilities of large language models is their ability to generate human-like text. Large language models can produce coherent and well-written prose on almost any topic in an instant. This ability makes them a valuable tool for a variety of applications, such as automatically generating responses to customer inquiries or even creating original content for social media posts. Users can request that the response is written in a specific tone, from humorous to professional, and can mimic the writing styles of authors such as William Shakespeare or Dale Carnegie.
Text Extraction
Large language models can also extract key information from unstructured text. This can be particularly helpful for search applications or more real-time use cases like call center optimization, such as automatically parsing a customer's name and address without a structured input. LLMs are particularly adept at text extraction because they importantly understand the context of words and phrases and can filter extraneous information from important details.
Summarization
Large language models can also perform text summarization, which is the process of creating a concise summary of a given piece of text that retains its key information and ideas. This can be useful for analyzing financial statements, historical market data, and other proprietary data sources and providing a summary of the documents for financial analysts. Text extraction combined with text summarization is a very powerful combination. For companies or industries with a large corpus of written data, these two properties can be combined to retrieve relevant information, such as unstructured text or knowledge in a structured database, and then summarize it accordingly for human consumption while also citing the specific source.
Question Answering
LLMs can retrieve data from knowledge bases, documents, or user-supplied text to answer specific questions, such as what types of products to select or what assets provide the highest returns. When a model is fine-tuned on domain-specific data and refined with RLHF, these models can answer questions incredibly accurately. For example, consider an eCommerce chatbot:
Search
Search engines like Bing, Google, and You.com already have or will incorporate LLMs into their search engines. Companies are also looking to implement LLMs as a form of enterprise search. While base foundation models are unreliable for citing facts, they summarize search results well. As we highlight throughout this guide, it is important to ensure that a base model is fine-tuned and aligned for any enterprise use case.
Software Programming
These models are also reshaping how developers write software. Code-completion tools like OpenAI's codex and Github Copilot give developers a powerful tool to increase their efficiency and debug their code. Programmers can ask for functions from scratch, or provide existing functions and ask the LLM to help them debug it. As the context window size increases, these tools will be able to help analyze entire code bases as well (source).
General Assistants
Large language models can be used for tasks such as data analysis, content generation, and even helping to design new products. The ability to quickly process and analyze large amounts of data can also help businesses make better decisions, increase employee productivity, and stay ahead of the competition.
Industry Use Cases
Let's quickly explore how a few industries are adopting Generative AI to improve their business:
- Insurance companies use AI to increase the operational efficiency of claims processing. Claims are often highly complicated, and Generative AI excels at properly routing, summarizing, and classifying these claims.
- Retail and eCommerce companies have for years tried to adopt customer chatbots, but they have failed to live up to their promise of streamlining operations and providing a better user experience. But now, with the latest generative chatbots that are fine-tuned on company data, chatbots finally provide engaging discussions and recommendations that dynamically respond to customer input.
- Financial services companies are building assistants for investment research that analyze financial statements, historical market data, and other proprietary data sources and provide detailed summaries, interactive charts, and even take action with plugins. These tools increase the efficiency and effectiveness of investors by surfacing the most relevant trends and providing actionable insights to help improve returns.
Overall, large language models offer a wide range of potential benefits for businesses.
A Brief History of Language Models
To better contextualize the impact of these models, it is essential to understand the history of natural language processing. While the field of natural language processing (NLP) began in the 1940s after World War II, and the concept for using neural networks for natural language processing dates back to the 1980s, it was not until relatively recently that the combination of processing power via GPUs and data necessary to train very large models became widely available. Symbolic and statistical natural language processing were the dominant paradigms from the 1950s through the 2010s.
Recurrent Neural Networks (RNNs) were popularized in the 1980s. RNNs are a basic form of artificial neural network that can handle sequential data, but they struggle with long-term dependencies. In 1997, LSTMs were invented; LSTMs are a type of RNN that can manage long-term dependencies better due to their gating mechanism. Around 2007, LSTMs began to revolutionize speech recognition, with this architecture being used in many commercial speech-to-text applications.
Throughout the early 2000s and 2010s, trends shifted to deep neural nets, leading to rapid improvement on the state of the art for NLP tasks. In 2017, the now-dominant transformer architecture was introduced to the world (source), changing the entire field of AI and machine learning. Transformers use an attention mechanism to process entire sequences at once, making them more computationally efficient and capable of handling complex contextual relationships in data. Compared to RNN and LSTM models, the transformer architecture is easier to parallelize, allowing training on larger datasets.
2018 was a seminal year in the development of language models built on this transformer architecture, with the release of both BERT from Google and the original GPT from OpenAI.
BERT, which stands for Bidirectional Encoder Representations from Transformers, was one of the first large language models to achieve state-of-the-art results on a wide range of natural language processing tasks in 2018. BERT is widely used in business today for its classification capabilities, as it is a relatively lightweight model and inexpensive to run in production. BERT was state of the art when it was first unveiled, but has now been surpassed in nearly all benchmarks by more modern generative models such as GPT-4.
The GPT family of models differs from BERT in that GPT models are generative and have a significantly larger scale leading to GPT models outperforming BERT on a wide range of tasks. GPT-2 was released in 2019, and GPT-3 was announced in 2020 and made generally available in 2021. Google released the open-source T5/FLAN (Fine-tuned Language Net) model and announced LaMDA in 2021, pushing the state of the art with highly capable models.
In 2022 the open-source BLOOM language model, the more powerful GPT-3 text-davinci-003, and ChatGPT were released, capturing headlines and catapulting LLMs to popular attention.In 2023, GPT-4 and Google's Bard chatbot were announced. Bard was originally running LaMDA, but Google has since replaced it with the more powerful PaLM 2 model.
There are now several competitive models including Anthropic’s Claude, Cohere’s Command Model, Stability AI’s StableLM. We expect to see these models to continue to improve and gain new capabilities over the next few years. In addition to text, multimodal models will be able to ingest and respond with images and videos, with a coherent understanding of the relationships between these modalities. Models will hallucinate less and will be able to more reliably interact with tools and databases. Developer ecosystems will proliferate around these models as a backend, ushering in an era of accelerated innovation and productivity. While we do expect to see larger models, we expect model builders will focus more on high quality data to improve model performance.
Model Size and Performance
Over time, LLMs have become more capable as they've increased in size. Model size is typically determined by its training dataset size measured in tokens (parts of words) or by its number of parameters (the number of values the model can change as it learns).
- BERT (2018) was 3.7B tokens and 240 million parameters (source).
- GPT-2 (2019) was 9.5B tokens and 1.5 billion parameters (source).
- GPT-3 (2020) has 499B tokens and 175B parameters (source).
- PaLM (2022) was 780 Billion tokens and 540 billion parameters (source).

As these models scaled in size, their capabilities continued to increase, providing more incentive for companies to build applications and entire businesses on top of these models. This trend continued until very recently.
But now, model builders are grappling with the fact that we may have reached a sort of plateau, a point at which additional model size yields diminishing performance improvements. Deepmind's paper on training compute-optimal LLMs, (source), showed that for every doubling of model size the number of training tokens should also be doubled. Most LLMs are already trained on enormous amounts of data, including most of the internet, so expanding dataset size by a large degree is increasingly difficult. Larger models will still outperform smaller models, but we are seeing model builders focusing less on increasing size and instead focusing on incredibly high-quality data for pre-training, combined with techniques like supervised fine-tuning (SFT), reinforcement learning with human feedback (RLHF), and prompt engineering to optimize model performance.
Fine-Tuning Large Language Models
What is fine-tuning of an LLM?
Fine-tuning is a process by which an LLM is adapted to specific tasks or domains by training it on a smaller, more targeted dataset. This can help the model better understand the nuances of the specific tasks or domains and improve its performance on those particular tasks. Through human evaluations on prompt distribution, OpenAI found that outputs from their 1.3B parameter InstructGPT model were preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters (source). Fine-tuned models perform better and are much less likely to respond with toxic content or hallucinate (make up information). The approach for fine-tuning these models included a wide array of different domains, though still a tiny subset compared to the entirety of internet data.
This principle of fine-tuning increasing task-specific performance also applies to single domains, such as a particular industry or specific task. Fine-tuning large language models (LLMs) makes them incredibly valuable for businesses. For example, a company that provides language translation services could fine-tune an LLM to understand better the nuances of a particular language or domain, such as legal documents or insurance claims. This understanding helps the model generate more accurate and fluent translations, leading to better customer satisfaction and potentially even higher revenues.
Another example is a business that generates product descriptions for an e-commerce website. By fine-tuning an LLM to understand the characteristics of different products and their features, the model could generate more informative and compelling descriptions, which could increase sales and customer engagement. Fine-tuning an LLM can help businesses tailor the model's capabilities to their specific needs and improve their performance in various tasks and domains.
What is the process of fine-tuning?
Fine-tuning an LLM generally consists of the following high-level process:
- Identify the task or domain you want to fine-tune the model for, which could be anything from language translation to text summarization to generating product descriptions.
- Gather a targeted dataset relevant to the task or domain you want to fine-tune the model for. This dataset should be large enough to provide the model with sufficient information to learn from but not so large that it takes a long time to train. A few hundred training examples is the minimum recommended amount, with more data increasing model quality. Ensure the data is relevant to your industry and specific use case.
- Use a machine learning framework, library, or a tool like Scale GenAI platform to train the LLM on the smaller dataset. This will involve providing the model with the original data, "input text," and the corresponding desired output, such as summarization or classification of the text into a set of predefined categories, and then allowing the model to learn from this data by adjusting its internal parameters.
- Monitor the model's performance as it trains and make any necessary adjustments to the training process to improve the output. This could involve changing the size of the training dataset, adjusting the model's learning rate, or modifying the model's architecture.
- Once the model has been trained, evaluate its performance on the specific task or domain you fine-tuned it for. This will involve providing the model with input text and comparing its actual output to the desired output.
Reinforcement Learning from Human Feedback (RLHF)
What is RLHF?
Reinforcement learning from human feedback (RLHF) is a methodology to train machine learning models by soliciting feedback from human users. RLHF allows for more efficient learning. Instead of attempting to write a loss function that will result in the model behaving more like a human, RLHF includes humans as active participants in the training process. RLHF results in models that align more closely with human expectations, a typical qualitative measure of model performance.

Models trained with RLHF
Models trained with RLHF, such as InstructGPT and ChatGPT, have the benefit of generally being more helpful and more aligned with a user's goals (source). These models are better at following instructions and tend not to make up facts (hallucinate) as often as models trained with other methods. Additionally, these models perform as well as traditional models but at a substantially smaller size (InstructGPT is 1.3 billion parameters, compared to GPT-3 at 175 billion parameters).
InstructGPT
OpenAI API uses GPT-3 based language models to perform natural language tasks on user prompts, but these models can generate untruthful or toxic outputs. To improve the models' safety and alignment with user intentions, OpenAI developed InstructGPT models using reinforcement learning from human feedback (RLHF). These models are better at following instructions and generate less toxic content. They have been in beta on the API for over a year and are now the default language models on the API. OpenAI believes that fine-tuning language models with human input is a powerful way to align them more closely with human values and make them more reliable (source).
ChatGPT
ChatGPT is a large language model that has been developed specifically for the task of conversational text generation. This model was initially trained with supervised fine-tuning with humans interacting to create a conversational dataset. The model was then fine-tuned with RLHF, with humans ranking model outputs which were then used to improve the model.
One of the key features of ChatGPT is its ability to maintain the context of a conversation and generate relevant responses. As such, it is a valuable tool for applications such as search engines or chatbots, where the ability to generate coherent and appropriate responses is essential. In addition, ChatGPT can be fine-tuned for even more specific applications, allowing it to achieve even better performance on specialized tasks.
Overall, ChatGPT has made large language models more accessible to a wider range of users than previous large language models.
While the high-level steps for fine-tuning are simple, to accurately improve the performance of the model for a specific task, expertise is required.
LLM Prompt Engineering
What is Prompt engineering?
Prompt engineering is the process of carefully designing the input text, or "prompt," that is fed into an LLM. By providing a well-crafted prompt, it is possible to control the model's output and guide it to generate more desirable responses. The ability to control model outputs is useful for various applications, such as generating text, answering questions, or translating sentences. Without prompt engineering, an LLM may generate irrelevant, incoherent, or otherwise undesirable responses. By using prompt engineering, it is possible to ensure that the model generates the desired output and makes the most of its advanced capabilities.
Prompt engineering is a nascent field, but a new career is already emerging, that of the "Prompt Engineer."
What does a prompt engineer do?
A prompt engineer for large language models (LLMs) is responsible for designing and crafting the input text, or "prompts," that are fed into the models. They must have a deep understanding of LLM capabilities and the specific tasks and applications it will be used for. The prompt engineer must be able to identify the desired output and then design prompts that are carefully crafted to guide the model to generate that output. In practice, this may involve using specific words or phrases, providing context or background information, or framing the prompt in a particular way. The prompt engineer must be able to work closely with other team members and adapt to changing requirements, datasets, or models. Prompt engineering is critical in ensuring that LLMs are used effectively and generate the desired output.
How do you prompt an LLM?
Prompt Engineering for an LLM generally consists of the following high-level process:
- Identify the task or application you want to use the LLM for, such as generating text, answering questions, or summarizing reports.
- Determine the specific output you want the LLM to generate, which could be a paragraph of text, a single value for classification, or lines of code.
- Carefully design a prompt to guide the LLM to generate the desired output. Be as specific as possible and provide context or background information to ensure that the language is clear.
- Feed the prompt into the LLM and observe the output it generates.
- If the output is not what you desired, modify the prompt and try again.
- Following these high level can help you get the most out of your model and make it more useful for a variety of applications. To quickly get started with prompt engineering for large language models, try out Scale Spellbook today.
Below we provide an overview of a few popular prompt engineering techniques:
Popular Prompt Engineering Techniques
Ensuring Brand Fidelity
In combination with RLHF and domain-specific fine-tuning, prompt engineering can help ensure that model responses reflect your brand guidelines and company policies. By specifying an identity for your model in a prompt, you can enforce the desired model behavior in various scenarios.
For instance, let's say that you are Acme Corp., a financial services company. A user has landed on your website by accident and is asking for advice on a particular pair of running shoes.

This response is an example of an AI hallucination or the model fabricating results. Though the company does not sell running shoes, it gladly responds with a suggestion. Let's update the default prompt, or system message, to cover this edge case.
Default Prompt: We will specify a default prompt, which is added to every session to define the default behavior of the chatbot. In this example, we will use this default prompt:
"You are AcmeBot, a bot designed to help users with financial services questions. AcmeBot responses should be informative and actionable. AcmeBot's responses should always be positive and engaging. If a user asks for a product or service unrelated to financial services, AcmeBot should apologize and simply inform the user that you are a virtual assistant for Acme Corp, a financial services company and cannot assist with their particular request, but that you would be happy to assist with any financial questions the user has."
With this default prompt in place, the model now behaves as we expect:

Improved Information Parsing
By specifying the desired template for the response, you can steer the model to return data in the format that is required by your application. For example, say you are a financial institution integrating existing backend systems with a natural language interface powered by an LLM. Your backend systems require a specific format to accept any data, which an LLM will not provide out of the box. Let's look at an example:

This response is accurate, but it is missing context that our backend systems need to parse this data properly. Let's specify the template we need to receive an appropriate response. Depending on the application, this template can also be added as part of a default prompt.

Now our data can be parsed by our backend system!
Adversarial or “Red-team” prompting
Chat models are often designed to be deployed in public-facing applications, where it's important they do not produce toxic, harmful, or embarrassing responses, even when users intentionally seek such material. Adversarial prompts are designed to elicit disallowed output, tricking or confusing a chat model into violating the policies its creators intended.
One typical example is prompt injection, otherwise referred to as instruction injection. Models are trained to follow user instructions but are also given a directive by a default prompt to behave in certain ways, such as not revealing details about how the model works or what the default prompt is. However, with clever prompting, the model can be tricked to disregard its programming and follow user instructions that conflict with its training or default prompt.
Below we explore a simple example of an instruction injection, followed by an example using a model that has been properly trained, fine-tuned, and with a default prompt that prevents it from falling prey to these common adversarial techniques:
Adversarial prompt with poor response:

Adversarial prompt with desired response:

Adversarial prompt engineering is an entire topic unto itself, including other techniques such as role-playing and fictionalization, unusual text formats and obfuscated tasks, prompt echoing, and dialog injection. We have only scratched the surface of prompt engineering here, but there are a wide array of different techniques to control model responses. Prompt engineering is evolving quickly, and experienced practitioners have spent much time developing an intuition for optimizing prompts for a desired model output. Additionally, each model is slightly different and responds to the same prompts with slightly different behaviors, so learning these differences adds another layer of complexity. The best way to get familiar with prompt engineering is to get hands on and start prompting models.
Conclusion
As we have seen, LLMs are versatile tools that can be applied to a wide variety of use cases. These models have already had a transformative impact on the business landscape, with billions of dollars being spent in 2023 alone. Nearly every industry is working on adopting these models into their specific use cases, from insurance companies looking to optimize claims processing, wealth managers looking for unique insights across a large number of portfolios to help them improve returns, or eCommerce companies looking to make it easier to purchase their products.
To optimize investments in LLMs, it is critical that businesses understand how to properly implement them. Using base foundation models out of the box is not sufficient for specific use cases. These models need to be fine-tuned on proprietary data, improved with human feedback, and prompted properly to ensure that their outputs are reliable and accomplish the task at hand.
At Scale we help companies realize the value of large language models, including those that are building large language models and those that are looking to adopt them to make their businesses better. Our GenAI Platform provides comprehensive testing and evaluation and data solutions to unlock the full value of AI.
Guide to Computer Vision Applications
Understand what computer vision is, how it works, and deep dive into some of the top applications for computer vision by industry.
Introduction
As discussed in our Authoritative Guide to Data Labeling, machine learning (ML) has revolutionized our approach to solving problems in computer vision and natural language processing.
This guide aims to provide an overview of computer vision (CV) applications within the field of machine learning: what it is, how it works, subfields of computer vision, and a breakdown of computer vision use cases by industry.
What is computer vision?
For decades, people have dreamed of developing machines with the characteristics of human intelligence. An important step in creating this artificial intelligence is giving computers the ability to “see” and understand the world around them.
Computer Vision is a field of artificial intelligence that focuses on developing systems that can process, analyze, and make sense of visual data (images, videos, and other sensor data) similar to the way humans do. From an engineering perspective, computer vision systems not only seek to understand the world around them but aim to automate the tasks the human visual system can perform.
How does computer vision work?
Computer vision is inspired by the way human visual systems and brains work. The computer vision algorithms we use today are based on pattern recognition, training models on massive amounts of visual data. For example, suppose we train a model on a million images of flowers. The system will analyze the million images, identify patterns that apply to all flowers, and at the end will learn to detect a flower given a new image.
A type of deep learning algorithm called a convolutional neural network (CNN) is critical to powering computer vision systems. A CNN consists of an input layer, hidden layers, and an output layer, and these layers are applied to find the patterns described above. CNNs can have tens or even hundreds of hidden layers.

Computer vision applications can be trained on a variety of data types, including images, videos, and other sensor data such as light detection and ranging (LiDAR) data, and radio detection and ranging (RADAR) data. Each data type has its strengths and shortcomings.
Images
Pros:
- Large-scale open-source datasets are available for image data (ImageNet, MS COCO, etc.).
- Cameras are inexpensive if you need to collect data from scratch.
- Images are easier to annotate compared to other data types.
Cons:
- Even the most popular large-scale datasets have known quality issues and gaps that can limit the performance of your models.
- If your use case requires depth perception (e.g. autonomous vehicles or robotics), images alone may not provide the accuracy you need.
- Static images alone are not sufficient to develop object-tracking models.
Videos
Pros:
- Again, cameras are inexpensive if you need to collect data from scratch.
- Enables the development of object tracking or event detection models.
Cons:
- More challenging to annotate compared to images, especially if pixel-level accuracy is required.
LiDAR
What is LiDAR?
LiDAR uses laser light pulses to scan its environment. When the laser pulse reaches an object, the pulse is reflected and returned to the receiver. The time of flight (TOF) is used to generate a three-dimensional distance map of objects in the scene.
Pros:
- LiDAR sensors are more accurate and provide finer resolution data than RADAR.
- Allows for better depth perception when developing computer vision systems.
- LiDAR can also be used to determine the velocity of a moving object in a scene.
Cons:
- Advancements in LiDAR technology have brought down costs in the last few years, but it is still a more costly method of data collection than images or videos.
- Performance degrades in adverse weather conditions such as rain, fog, or snow.
- Calibrating multiple sensors for data collection is a challenge.
- Visualizing and annotating LiDAR data is technically challenging, requires more expertise, and can be expensive.
RADAR
What is RADAR?
RADAR sensors work much like LiDAR sensors but use radio waves to determine the distance, angle, and radial velocity of objects relative to the site instead of a laser.
Pros:
- Radio waves have less absorption compared to the light waves used by LiDAR. Thus, they can work over a relatively long distance, making it ideal for applications like aircraft or ship detection.
- RADAR performs relatively well in adverse weather conditions such as rain, fog, or snow.
- RADAR sensors are generally less expensive than LiDAR sensors.
Cons:
- Less angularly accurate than LiDAR and can lose sight of target objects on a curve.
- Less crisp/accurate images compared to LiDAR.
Notable research in computer vision
Advancements in the field of computer vision are driven by robust academic research. In this chapter, we will highlight some of the seminal research papers in the field in chronological order.
ImageNet: A large-scale hierarchical image database
J. Deng, W. Dong, R. Socher, L. -J. Li, Kai Li and Li Fei-Fei, "ImageNet: A large-scale hierarchical image database," 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248-255, doi: 10.1109/CVPR.2009.5206848.
Why it’s important: This paper introduced the Imagenet dataset, which has been the standard in the field of computer vision since 2009.
AlexNet:
A. Krizhevsky, I. Sutskever, and Geoffrey Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Advances in Neural Information Processing Systems (NIPS), 2012.
Why it’s important: This paper put convolutional neural networks (CNNs) on the map as a solution to solve complicated vision classification tasks.
ResNet:
K. He, X. Zhang, S. Ren, Jian Sun, “Deep Residual Learning for Image Recognition,” arXiv, 2015.
Why it’s important: This paper introduced key ideas to help train significantly deeper CNNs. Deeper CNNs are crucial to improving the performance of computer vision models.
MoCO:
K. He, H. Fan, Y. Wu, S. Xie, and Ross Girshick, “Momentum Contrast for Unsupervised Visual Representation Learning,” 2020 IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Why it’s important: This was the first self-supervised learning paper that was competitive with supervised learning (and sparked the field of contrastive learning).
Vision Transformers:
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” 2021 International Conference on Learning Representations, 2020.
Why it’s important: This paper showed how transformers, which were already dominant in natural language models, could be applied for vision.
NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis:
B. Mildenhall, P. Srinivasan, M. Tancik, J. Barron, R. Ramamoorthi, and Ren Ng, “NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis,” 2020 European Conference on Compter Vision, 2020.
Why it’s important: This was a foundational paper for hundreds of papers in the last few years showing how to generate novel views of a 3d scene using a small number of captured images, representing the entire scene implicitly (as opposed to using classical computer graphics representations such as meshes and textures).
Masked Autoencoders:
K. He, X. Chen, S. Xie, Y. Li, P. Dollar, and Ross Girshick, “Masked Autoencoders are Scalable Vision Learners,” arXiv, 2021.
Why it’s important: This paper introduced a new self-supervised learning technique that uses masking ideas that were successful in language. Advantages include that it isn't based on contrastive learning and is efficient.
What are some subfields of computer vision?
There are many different subfields or subdomains within computer vision. Some of the most common include: object classification, object detection, object recognition, object tracking, event detection, and pose estimation. In this chapter, we will provide a brief overview as well as an example of these subfields.
Object Classification
With object classification, models are trained to identify the class of a singular object in an image. For example, given an image of an animal, the model would return what animal is identified in the image (e.g. an image of a cat should come back as “cat”).
Object Detection
With object detection, models are trained to identify occurrences of specific objects in a given image or video. For example, given an image of a street scene and an object class of “pedestrian”, the model would return the locations of all pedestrians in the image.
Object Recognition
With object recognition, models are trained to identify all relevant objects in a given image or video. For example, given an image of a street scene, an object recognition model would return the locations of all objects it has been trained to recognize (e.g. pedestrians, cars, buildings, street signs, etc.)
Object Tracking
Object tracking is the task of taking an initial set of object detections, creating unique IDs for each detection, and then tracking each object as it moves around in a video. For example, given a video of a fulfillment center, an object tracking model would first identify a product, tag it, then track it over time as it is moved around the facility.
Event Detection
With event detection, models are trained to determine when a particular event has occurred. For example, given a video of a retail store, an event detection model would flag when a customer has picked up or bagged a product to enable autonomous checkout systems.
Pose Estimation
With pose estimation, models are trained to detect and predict the position and orientation of a person, object, or keypoints in a given scene. For example, given an ego-centric video of a person opening a door, a pose estimation model will detect and predict the position and orientation of the first person’s hands to unlock and open the door.
Depth Estimation
With depth estimation, models are trained to measure the distance of an object relative to a camera or sensor. Depth can be measured either from monocular (single) or stereo (multiple views) images. Depth estimation is critical to enable applications such as autonomous driving, augmented reality, robotics, and more.
Generation
Generative models or diffusion models are a class of machine learning models that can generate new data based on training data. For more on diffusion models, take a look at our Practical Guide to Diffusion Models.
What are the top computer vision use cases by industry?
Various industries are developing computer vision applications, from automotive, software & internet, healthcare, retail & eCommerce, robotics, government & public sector, and more. In this chapter, we provide a non-exhaustive list of top computer vision use cases for each industry.
Automotive
Autonomous driving is the automotive industry’s most prominent use case for computer vision. Autonomous driving, however, is not a zero or one proposition. Many automotive manufacturers have been incrementally adding safety and self-driving features to their vehicles that vary in degrees of autonomy and human control. The automotive industry has standardized on a scale from 0 to 5 to describe the levels of autonomy.
Advanced Driver-Assistance Systems (ADAS) are technological features in vehicles designed to increase the safety of driving. ADAS features generally fall under level one or two autonomy in the five levels of autonomous driving. Features such as adaptive cruise control, emergency brake assistance, lane-keeping, and parking assistance are examples of ADAS features.
Autonomous vehicles, or self-driving cars, generally fall under level three, four, or five autonomy. AVs are capable of sensing the environment around them and navigating safely with little to no human input.
Software & Internet
The software and internet industry is pioneering computer vision applications in augmented and virtual reality (AR/VR), content understanding, and more.
Augmented Reality (AR) integrates real-world objects with computer-generated information in the form of text, graphics, audio, and other virtual enhancements. Examples of augmented reality include virtual try-on, Snapchat filters, Pokémon Go, interior decorating apps, 3D exploration of diagnostic imaging, equipment/robotics repair, and more. Virtual Reality (VR), on the other hand, fully immerses a user in a virtual world and obscures the real world.
Content Understanding analyzes, enriches, and categorizes content from social media posts, videos, images, and more. One core application of content understanding is content data enrichment or adding metadata to improve model recommendation rankings. By developing an improved understanding of your content, teams can quickly discover growth areas for better personalization. A second core application is trust and safety, through automated detection of user-generated content that violates a platform's guidelines. Content understanding improves personalization, content recommendation systems, and increased user safety.
Healthcare
The healthcare sector is leveraging computer vision technology to enable healthcare professionals to make better decisions and ultimately improve patient outcomes. The standardization of medical imaging in the Digital Imaging and Communications in Medicine (DICOM) format, as well as the increased use of wearable devices, has led to use cases such as:
- Diagnostics
- Patient monitoring
- Research and development
Retail & eCommerce
Computer vision applications can benefit both shoppers and retailers. CV technology can enhance product discovery and deliver a more seamless shopping experience for customers while enhancing customer engagement, cost savings, operational efficiency, and clicks and conversions for retailers. Use cases include:
- Autonomous Checkout
- Product Matching/Deduplication
- AI-generated product imagery
For more on AI for eCommerce, take a look at our Guide to AI for eCommerce.
Robotics
Agriculture, warehousing and manufacturing, and any other industry that uses robotics has begun leveraging computer vision technology to improve their operations and enhance safety. Uses cases include:
- Inventory sorting and handling
- Defect detection
- Automated harvesting
- Plant disease detection
Government & Public Sector
The US government has troves of geospatial data across a number of sensor types including electro-optical (EO), synthetic aperture radar (SAR), and full motion video (FMV). The amount of geospatial data produced by the US government and commercial satellite vendors is increasing while the amount of analysts power is staying the same. The use of computer vision to process, exploit, and disseminate (PED) this data is essential for the government to utilize the full potential of all available data and derive increasingly meaningful insights about the way our world operates.
Use cases include:
- Assessment of natural disasters and war on infrastructure. Take a look at TIME magazine’s Detecting Destruction best AI invention.
- Intelligence, surveillance, and reconnaissance (ISR)
- Perimeter security
- Environmental monitoring from space
Conclusion
A wide range of industries is developing computer vision applications. The success of computer vision models, however, is highly dependent on the quality of training data models are trained on. From generating, and curating, to annotating data, data pipelines must be set up for success.
To generate data for your models, you can collect data from scratch, leverage existing open-source datasets, or synthetically generate or augment datasets. Once you have generated data, you need to curate that data to identify points of failure and rare edge cases to optimize model performance. From there, you can annotate your data. For a more comprehensive guide on how to annotate data for your machine learning projects, take a look at our Authoritative Guide to Data Annotation.
We hope you found this guide helpful as you think about developing your own CV applications. If you already have data and are ready to get started, check out Scale Generative AI Platform, our platform that automatically trains state-of-the-art machine learning models.
Additional Resources
Guide to AI for eCommerce
This guide details the main applications of Artificial Intelligence for the eCommerce Industry.
Introduction
81% of retail executives say AI is at least moderately to fully functional in their organization. However, 78% of retail executives surveyed state it is hard to keep up with the evolving AI landscape. In recent years, eCommerce teams have accelerated the need to adapt to new customer preferences and create exceptional digital shopping experiences. AI adoption is no longer a choice but a necessity for retailers to drive growth at scale and maintain market differentiation. eCommerce companies are now using AI to create new forms of customer engagement, enhance online checkout solutions, and drive cost-effective processes for digital commerce.
This guide will provide a comprehensive overview of the main applications for AI in eCommerce companies and share best practices from Scale’s experience in retail.
AI for eCommerce: Why is it important?
There are several ways AI is beneficial for eCommerce:
Enhance the customer experience: AI solutions for eCommerce can help companies personalize product recommendations, improve search results, and better understand customer sentiment. With accurate personalization and recommendation machine learning models, companies can help reduce time to buy, accurately portray products on product detail pages, and better understand customer behavior. With an investment in accurate ML models, teams can achieve goals of increasing shopping conversion rates and higher customer satisfaction. In addition, eCommerce companies can increase trust and safety by removing content that violates platform guidelines, from user-generated content to merchant-specific data.
Maximize profitability: ML models can help deliver accurate and targeted product recommendations based on shopping and browsing history and segment customer profiling for more accurate advertising. Teams can better understand the content and product landscape By enriching content metadata with AI. This enables eCommerce companies to focus better on product and content growth efforts and narrow in on trends early.
Accelerate Operational Processes: Shopping and content trends move quickly where manual operational processes are too slow. Accelerate operational processes such as new merchant onboarding, demand forecasting, and content optimization. Techniques such as human-in-the-loop can augment machine learning models for human-level accuracy and quality.
Existing processes without AI do not scale to meet the changing needs of consumers. There are three key challenges that eCommerce marketplaces face:
- The cost and investment are exponential: Using in-house operations teams alone to manage eCommerce data and activate new products can often inhibit growth. Manual operations to source, clean, and enrich data are time-consuming. Generating new product assets, such as product descriptions and product photography, is costly.
- Lack of attribute data: Personalization systems are limited by sparse attribute data. Product data may include incorrect information, duplicates, and missing attributes leading to poor search and product recommendations. Insufficiently detailed content metadata on user behavior leads to content recommendation systems that fall short.
- Manual processes are too slow: Consumer behaviors and content trends move quickly. Current systems require too much time and process to discover and surface trending content, and platforms fall behind on retaining customer engagement and conversion.
In this guide, we’ll explain the main use cases to help solve these challenges and provide a roadmap to help grow your business with AI.
AI in eCommerce: Main Use Cases
There are many different applications for AI in eCommerce. In this guide, we will focus on six main categories for data-centric applications in eCommerce:
- Search, Advertising, and Discovery
- Demand Forecasting and Inventory Management
- Chatbots and Customer Service
- Content Understanding
- Enriched Product Data
- AI-Generated Product Imagery
1. Search, Advertising, and Discovery

Strong customer experience starts with highly personalized recommendations, targeted product offers, and search relevance. There are three main use cases for personalized recommendations with AI:
Search relevance and item discovery: 49% of online purchasers scroll past the first page to look for what they want. Search and item discovery are key components to improving the customer shopping experience, and helping customers find the right product. AI-powered search engines use natural language processing (NLP) to process and understand the query. The search engine then uses the meaning to present best ranking search results. With AI-powered search relevance, eCommerce teams can better understand the true intention behind a search term and surface the most relevant results for a customer.
Ad and offer recommendations: Based on search, browse, add to cart, and purchase history, retailers can deliver targeted advertising and offers. Retailers can use machine learning to capture customer data, synthesize insights, and deliver a personalized shopping experience. Machine learning recommender systems use a recommender function, which takes information about the user including their browsing and purchase history, and predicts the rating the user will assign to a given product. Better enhanced data can aid in brands looking to deliver advertising and offers to customers. Targeted advertising aids in new customer acquisition and helps re-engage customers who may have abandoned their cart.
Product recommendations: For commerce teams who are looking to lift product sales, product recommendations are key to improving ROI. ML models analyze purchasing history and build lookalike customer audiences to deliver personalized product recommendations. For example, ML models can provide recommendations for similar products, products frequently purchased together, or products bought from lookalike audiences. Product recommendations add value to retailers by encouraging repeat purchases and increasing the average order value.
2. Demand Forecasting and Inventory Management

AI applications for supply chain management and logistics can dramatically accelerate processes in the global supply chain. There are three main use cases for supply chain management with AI:
Demand Forecasting: One of the greatest challenges in supply chain management is demand volatility. AI-powered demand forecasting uses machine learning algorithms to predict and recognize changes in consumer demand. ML algorithms use both historical time series data, such as pricing and promotions, and any associated data such as product features and categories to determine relationships in large datasets. This allows eCommerce teams to recognize demand patterns and forecast future demand fluctuations to reduce inventory loss.
Inventory management: Accurate AI-enabled demand forecasting has significant downstream impact on inventory management. Improved forecasting can lead up to 65% reduction in lost sales due to inventory that is out of stock. In addition to creating more accurate inventory, AI can help streamline aspects of warehouse management using Internet of Things (IoT) devices. With IoT, retailers can optimize warehouse operations and shipping processes with real-time inventory control.
Dynamic Pricing: With improved demand forecasting and inventory management, retailers can also set dynamic pricing to increase profit. Dynamic pricing enables teams to shift from traditional, manual static pricing to pricing that changes in real-time. AI algorithms use a combination of historical sales and price data, market demand, external events, and competitor pricing to generate a model based on the input parameters. Dynamic pricing has numerous benefits including better market segmentation, reduction of cost, and maximizing ROI.
3. Chatbots and Customer Service

Customer service is an increasingly important component to keep customers engaged and improve customer sentiment. However, keeping up with high-volume customer requests on multiple channels can be challenging. Similarly, live agents can be costly and increase response time. AI-powered chatbots are an integral part to helping solve these challenges for customer support. AI-driven chatbots are virtual assistants that use natural language processing and conversational AI to help respond to customer inquiries. There are four key ways Chatbots can support customer service:
- Engage and respond to customer inquiries: Chatbots can provide guidance for product related questions and answer frequently asked questions about sizing, product variants, or discounts.
- Boost sales processes: Chatbots can help provide product recommendations and reduce cart abandonment by reminding customers of products they may have left in their cart.
- Offer post-sale support: Chatbots can provide order tracking, returns and exchange processing, and collect customer feedback.
4. Content Understanding
Content forms are ever-evolving and growing as media consumer technologies advance. eCommerce websites contain vast amounts of user-generated content, from merchant and seller data to customer reviews. To keep up with the growth of content, provide the highest quality recommendations, and keep users engaged on your platform, eCommerce teams need robust content understanding systems. Building a strong content understanding system on eCommerce website encompasses three core use cases:
Data enrichment: Enriching content metadata with content categorization and identification is at the core of building a strong recommendation system. By enriching content metadata, teams can use granular information for better content ranking, identification of unclassified content, and targeted personalization. In our previous guide, we explained how data labeling is the process of assigning context to data so machine learning algorithms can achieve the desired result. As much of user-generated content is unstructured, data enrichment is useful for content teams to build richer personalization and recommendation systems.

Content intelligence: A deep understanding of emerging trends and content distribution is critical for demand forecasting and understanding customer behavior. One key application of content intelligence is trend detection. Teams can use machine learning to process and label videos to quickly uncover microtrends as they pop up on a daily basis. Human-in-the-loop techniques are then used to detect trend signals using multi-modal inputs. This enables eCommerce teams to better synthesize and act on trends to discover growth opportunities for the business.
Trust and safety: Harmful content and malicious actors are becoming increasingly pervasive across platforms and communities. Scalable detection is necessary to protect your customers and brand. Automate the detection of harmful or malicious user-generated content with robust AI models. eCommerce teams can reduce manual moderation by utilizing AI models with human-level precision that improve with the scale of content.
5. Enriched Product Data

At the core of eCommerce data is high-quality product catalog data. Accurate product catalog data includes detailed attributes that are displayed on the product detail page (PDP), such as product descriptions, color, material, size, brand, and product taxonomy. There are three main use cases eCommerce companies can invest in for catalog data:
Catalog Creation: Catalog creation is a great starting point for eCommerce teams building new shopping experiences on platforms such as social media. Creation enables teams to aggregate, enrich, and refresh product data from seller feeds and the public internet. Machine learning infrastructure can ingest brands, sellers, or sites and provide all available products and associated attributes. Examples of applications include social commerce, where shopping is natively built into a social media platform. This enables new shopping opportunities for customers on existing digital web applications.
Attribute enrichment: Add attribute data to your existing products to help enhance product taxonomy, rank products by relevance, and produce granular search results. Attributes are extracted from an image and text by employing machine learning models that rely on named entity recognition and image classification techniques. Improving underlying product catalog data is important because incorrect data can lead to poor search results, incorrect product category taxonomy, or inaccurate product recommendations. Because search and recommendation systems are built on accurate product attributes, attribute enrichment is key for product teams looking to improve search and relevance.

Detailed product data such as descriptions, attributes, variants, and interactive media have a compound effect on the revenue generated for eCommerce companies.
Product Matching and deduplication: AI-accelerated human annotation can help remove product duplicates, merge product variants, fix inconsistencies on product detail pages, and correct errors to enable item authority. A matching endpoint takes information about two distinct products and identifies whether or not they are a match and the corresponding model confidence score. Finding product matches can help remove product duplicates from the catalog to produce more accurate results for customers.

eCommerce teams can improve engagement, discoverability, and conversion on product websites with accurate and rich product data.
6. AI-Generated Product Imagery
In our previous guide, we explained how diffusion models have the power to generate any image you can imagine. This has a multitude of applications for marketers and brand managers to generate new product imagery for ad creatives, campaigns, and social media. Research has shown that conversion rates double with the number of images of a product.
Currently, retailers and advertisers are limited by the amount and quality of product photography they have to create an engaging shopping experience for customers. The investment needed for photoshoots, scale of product catalogs, and the diverse tastes of audiences adds to this burden.
To help solve this challenge, teams can use generative AI to create a multitude of high-fidelity product images in different scenes while maintaining brand preservation for retail products.
How to implement AI for eCommerce
- Align on product goals: Identifying a business problem first and tying goals to product performance metrics is crucial to implementing AI for eCommerce. By working closely with product teams, teams working in eCommerce can provide a direct correlation with internal metrics.
- Narrow in on a use case: Focus on a specific use case that solves your business problem and enables revenue generation.
- Choose a workforce: Implementing a full-scale solution for eCommerce requires expertise. Bring in experts to help you build a roadmap to solve your business problems.
- Experiment to get started: Don’t limit experimentation with solutions that only provide immediate ROI. You may not know what experiment will give you an exponential return, so have multiple tests in parallel and review the data to understand the impact.
Conclusion
This guide covered the main use cases and applications for AI in eCommerce. As the retail and eCommerce landscape rapidly evolves, it’s essential to accelerate innovation to meet the needs of customers. In a recent study, 69% of retail executives say their organization’s AI initiatives yield more value. At Scale, we believe investing in data is the key to unlocking success for eCommerce companies. We’re excited to see what companies can create with access to the best AI tools.
Training and Building Machine Learning Models
The Foundational Guide
Contents
Training Models for Machine Learning
As we presented in our previous Authoritative Guide to Data Labeling, machine learning (ML) has revolutionized both state of the art research, and the ability of businesses to solve previously challenging or impossible problems in computer vision and natural language processing. Predictive models, trained on vast amounts of data, now have the ability to learn and detect patterns reliably, all without being specifically programmed to execute those tasks.
More broadly, ML models can predict numerical outcomes like temperature or a mechanical failure, recognize cars or retail products, plan better ways to grasp objects, and generate useful and helpful, salient and logical text, all without human involvement. Want to get started training and building models for your business use case? You’ve come to the right place to learn how model training works, and how you too can start building your own ML models!
What Are Machine Learning (ML) Models?
ML models typically take “high-dimensional” sets of data artifacts as inputs and deliver a classification, a prediction, or some other indicator as an output. These inputs can be text prompts, numerical data streams, images or video, audio, or even three-dimensional point cloud data. The computational process of producing the model output is typically called “inference,” a term adopted from cognitive science. The model is making a “prediction” based on historical patterns.
What distinguishes a ML model from simple heuristics (often conditional statements) or hard-coded feature detectors (yes, face recognition used to depend on detecting a specific configuration of circles and lines!) is a series of “weights,” typically floating point numbers, grouped in “layers,” linked by functions. The system is trained through trial and error, adjusting weights to minimize error (a metric typically referred to as “loss” in the ML world) over time. In nearly all ML models, there are too many of these weights to adjust them manually or selectively; they must be “trained” iteratively and automatically, in order to produce a useful and capable model. Ideally, this model has “learned” on the training examples, and can generalize to new examples it hasn’t seen before in the real world.
Because these weights are iteratively trained, the ML engineer charged with designing the system in most cases can only speculate or hypothesize about the contribution of each individual weight to the final model. Instead, she must tweak and tune the dataset, model architecture, and hyperparameters. In a way, the ML engineer “steers the ship” rather than micromanaging the finest details of the model. The goal after many rounds of training and evaluation (known as “epochs”) is to induce the process to reduce model error or loss (as we mentioned above) closer and closer to zero. Typically when a model “converges,” loss decreases to a global minimum where it often stabilizes. At this point, the model is deemed “as good as it’s going to get,” in the sense that further training is unlikely to yield any performance improvements.
Sometimes it’s possible to detect that a model’s performance metrics have stabilized and engage a technique known as “early stopping.” It doesn’t make sense to spend additional time and compute spend on additional training that doesn’t meaningfully improve the model. At this stage, you can evaluate your model to see if it’s ready for production or not. Real-world user testing is often helpful to determine if you’re “ready” to launch the product that encapsulates your model, or you need to continue tweaking, adding more data, and re-training. In most applications, externalities will cause model failures or drift, requiring a continued process of maintenance and improvement of your model.
Divvying up your data
In order to train a model that can properly “generalize” to data it has never seen before, it’s helpful to train the model on most on 50-90% of available data, while leaving 5-20% out in a “validation” set to tune hyperparameters, and then also save 5-20% to actually test model performance. It’s important not to “taint” or “contaminate” the training set with data the model will later be tested on, because if there’s identical training assets between train and test, the model can “memorize” the result, thereby overfitting on that example, compromising its ability to generalize, which is typically an important attribute of nearly every successful ML model. Some researchers refer to the test set (that the model has never seen before) as the “hold-out” or “held out” set of data. You can think of this as the “final exam” for the model, which the model shouldn’t have seen verbatim before exam day, even if it has seen similar examples in the metaphorical problem sets (to which it has checked against the answer key) during prior training.
What types of data can ML models be trained on?
Tabular data
If you’re simply interested in computer vision, or some of the more sophisticated and recent data types that ML models can tackle, skip ahead to the Computer Vision section. That said, working with tabular data is helpful to understand how we arrived at deep learning and convolutional neural networks for more complex data types. Let’s begin with this simpler data type.
Tabular data typically consists of rows and columns. Columns are often different data types that correspond to each row entry, which might be a timestamp, a person, a transaction, or some other granular entry. Collectively, these columns can serve as “features” that help the model reliably predict an outcome. Or, as a data scientist, you can choose to multiply, subtract, or otherwise combine multiple columns and train the model on those combinations. For tabular data, there are a wide variety of possible models one can apply, to predict a label, a score, or any other (often synthesized) metric based on the inputs. Often, it’s helpful to eliminate columns that are “co-linear,” although some models are designed to deprioritize columns that are effectively redundant in terms of determining a predictive outcome.
Tabular data continues the paradigm of separating training and test data, so that the model doesn’t “memorize” the training data and overfit—regurgitate examples it has seen, but fumble its response to ones it hasn’t. It even enables you to dynamically shift the sections of the table that you’ll use (best practice is to randomize the split, or randomize the table first) such that test, train, and evaluation sets are all in windows that can be slid or swapped across your dataset. This is known as cross-fold validation, or n-folds validation, where n represents the number of times the table is “folded” to divvy up your training and test sets in different portions of the table.

Source: Wikipedia
A final point about tabular data that we’ll revisit in the computer vision section is that data often has to be scaled to a usable range. This might mean scaling a range of values from 1 to 1000000 into a floating point number such that the range is between 0 and 1.0 or -1.0 and 1.0. Machine Learning engineers often need to experiment with different types of scaling (perhaps logarithmic is also useful for some datasets) in order for the model to reach its most robust state—generating the most accurate predictions possible.
Text
No discussion of machine learning would be complete without discussing text. Large language models have stolen the show in recent years, and they generally function to serve two roles:
- Translate languages from one to another (even a language with only minimal examples on the internet)
- Predict the next section of text—this might be a synthesized verse of Rumi or Shakespeare, a typical response to a common IT support request, or even an impressively cogent response to a specific question
In addition to deep and large models, there are a number of other techniques that can be applied to text, often in conjunction with large language models, including unsupervised techniques like clustering, principal components analysis (PCA), and latent dirichlet allocation (LDA). Since these aren’t technically “deep learning” or “supervised learning” approaches, feel free to explore them on your own! They may prove useful in conjunction with a model trained on labeled text data.
For any textual modeling approach, it’s also important to consider “tokenization.” This involves splitting words in the training text into useful chunks, which might be individual words, full sentences, stems, or even syllables. Although it’s now quite old, the Python-based Natural Language Toolkit (NLTK) includes Treebank and Snowball tokenizers, which have become industry standard. SpaCy and Gensim also include more modern tokenizers, and even PyTorch, a cutting-edge, actively developed Python ML library includes its own tokenizers.
But back to large language models: it’s typically helpful to train language models on very large corpuses. Generally speaking, since text data requires much less storage than high resolution imagery, you can either train models on massive text datasets (generally known as “corpuses,” or if you’re a morbid scholar of Latin, maybe you can write corpora) such as the entire Shakespeare canon, every Wikipedia article ever written, every line of public code on GitHub, every public-domain book available on Project Gutenberg, or if you’d rather writing a scraping tool, as much of the textual internet as you can save on a storage device quickly accessible by the system on which you plan to train your model.
Large language models (LLMs) can be “re-trained” or “fine tuned” on text data specific to your use case. These might be common questions and high quality responses paired with each question, or simply a large set of text common to the same company or author, from which the model can predict the next n words in the document. (In Natural Language, every body of text is considered a “document”!) Similarly, translation models can start with a previously trained, generic LLM and be fine-tuned to support the input-output use case that translation from one language to another requires.
Images
Images are one of the earliest data types on which researchers trained truly “deep” neural networks, both in the past decade, and in the 1990s. Compared to tabular data, and in some cases even audio, uncompressed images take up a lot of storage. Image size not only scales with the width and height (in pixels) but also in color depth. (For example, do you want to store color information or only brightness values? In how many bits? And how many channels?) Handwritten digit detection is one of the simpler images to detect as it requires only comparing binary pixel values at relatively low resolution. In the 1990s, most neural networks were laboriously trained on large sequential computers to adjust the model weights so that handwriting recognition (in the form of Pitney Bowes’ postal code detector in conjunction with Yann LeCun’s LeNet network), and MNIST’s DIGITS dataset is still considered a useful baseline computer vision dataset for proving a minimal baseline of viability for a new model architecture.
More broadly speaking, images include digital photographs as most folks know and love them, captured on digital cameras or smartphones. Images typically include 3 channels, one each for red, blue, and green. Brightness is encoded usually in the form of 8- or 10-bit sequences for each color channel. Some ML model layers will simply look at brightness (in the form of a grayscale image) while others may learn their “features” from specific color channels while ignoring patterns in other channels.
Many models are trained on CIFAR (smaller, 10 or 24 classes of labels), ImageNet (larger, 1000 label classes), or Microsoft’s COCO dataset (very large, includes per-pixel labels). These models, once reaching a reasonable level of accuracy, can be re-trained or fine-tuned on data specific to a use case: for example, breeds of dogs and cats, or more practically, vehicle types.
Video
Video is simply the combination of audio and a sequence of images. Individual frames can be used as training data, or even the “delta” or difference between one frame and the next. (Sometimes this is expressed as a series of motion vectors representing a lower-resolution overlay on top of the image.) Generally speaking, individual video frames can be processed with a model just like an individual image frame, with the only difference that adjacent frames can leverage the fact that there might be overlap between a detected object in one frame and its (sometimes) nearby location in the next frame. Contextual clues can assist per-frame computer vision, including speech recognition or sound detection from the paired audio track.
Audio
Sound waves, digitized in binary representation, are useful not just for playback and mixing, but also for speech recognition and sentiment analysis. Audio files are often compressed, perhaps with OGG, AAC, or MP3 codecs, but they typically all decompress to 8, 16, or 24 bit amplitude values, with sample rates anywhere between 8 kHz and 192 kHz (typically at multiples-of-2 increments). Voice recordings generally require less quality or bit-depth to capture, even for accurate recognition, and (relatively) convincing synthesis. While traditional speech to text services used Hidden Markov Models (HMMs), long short-term memory networks (LSTMs) have since stolen the spotlight for speech recognition. They typically power voice-based assistants you might use or be familiar with, such as Alexa, Google Assistant, Siri, and Cortana. Training text to speech and speech to text models typically takes much more compute than training computer vision models, though there is work ongoing to reduce barriers to entry for these applications. As with many other use cases, transformers have proven valuable in increasing the accuracy and noise-robustness of speech-to-text applications. While digital assistants like Siri, Alexa, and Google Assistant demonstrate this progress, OpenAI’s Whisper also demonstrates the extent to which these algorithms are robust to mumbling, background interference, and other obstacles. Whisper is unique in that it can be called via API rather than used in a consumer-oriented end product.
3D Point Clouds
Point clouds encode the position of any number of points, in three-dimensional Cartesian (or perhaps at the sensor output, polar) space. What might at first look like just a jumble of dots on screen typically can be “spun” on an arbitrary axis with the user’s mouse, revealing that the arrangement of points is actually three-dimensional. Often, this data is captured by laser range-finders that spin radially, mounted on the roof of a moving vehicle. Sometimes a “structured light” infrared camera or “time-of-flight” camera can capture similar data (at higher point density, usually) for indoor scenes. You may be familiar with similar cameras thanks to the Wii game console remote controller (Wiimote), or the Xbox Kinect camera. In both scenarios, sufficiently detailed “point clouds” can be captured to perform object recognition on the object in frame for the camera or sensor.
Multimodal Data
It turns out the unreasonable effectiveness of deep learning doesn’t mandate that you only train on one data type at a time. In fact, it’s possible to ingest and train your model on images and audio to produce text, or even train on a multi-camera input in which all cameras capturing an object at a single point in time are combined into a single frame. While this might be confusing as training data to a human, certain deep neural networks perform well on mixed input types, often treated as one large (or “long”) input vector. Similarly, the model can be trained on pairs of mismatched data types, such as text prompts and image outputs.
What are some common classes or types of ML models?
Support Vector Machines (SVMs)
SVMs are one of the more elementary forms of machine learning. They are typically used as classifiers (yes, it’s OK to think of “hot dog” versus “not hot dog,” or preferably cat versus dog as labels here), and while no longer the state of the art for computer vision, they are still useful for handling “non-linear” forms of classification, or classification that can’t be handled with a simple linear or logistic regression. (think nearest neighbor, slope finding on a chart of points, etc.) You can learn more about SVMs on their documentation page at the scikit-learn website. We’ll continue to use scikit-learn as a reference for non-state-of-the-art models, because their documentation and examples are arguably the most robust available. And scikit-learn (discussed in greater depth below) is a great tool for managing your dataset in Python, and then proving that simpler, cheaper-to-train, or computationally less complex models aren’t suitable for your use case. (You can also use these simpler models as baselines against which you can compare your deep neural nets trained on Scale infrastructure!)
Random Forest Classifiers
Random Forest Classifiers have an amazing knack of finding just the right answer, whether you’re trying to model tabular data with lots of collinearities or you need a solution that’s not computationally complex. The “forest” of trees is dictated by how different buckets of your data impact the output result. Random Forest Classifiers can find non-linear boundaries between adjacent classes in a cluster map, but they typically don’t track the boundary perfectly. You can learn more about Random Forest Classifiers, again, over at scikit-learn’s documentation site.

Source: Wikipedia
Gradient-Boosted Trees
Gradient-boosting is another technique/modification applied to decision trees. These models can also handle co-linearities, as well as a number of hyperparameters that can limit overfitting. (Memorizing the training set so that training occurs with high accuracy, but does not extend to the held-out test set.) XGBoost is the framework that took Kaggle by storm, and LightGBM and Catboost also have a strong following for models of this class. Finally, this level of model complexity begins to depart from some of the models in scikit-learn that derive from simpler regressions, as hyperparameter count increases. (Basically there are now more ways in which you can tune your model, increasing complexity.) You can read all about how XGBoost works here. While there are some techniques to attribute model outputs to specific columns or “features” on the input side, with Shapley values perhaps, XGBoost models certainly demonstrate that not every ML model is truly “explainable.” As you might guess, explainability is challenged by model complexity, so as we dive deeper into complex neural networks, you can begin to think of them more as “black boxes.” Not every step of your model will necessarily be explainable, nor will it necessarily be useful to hypothesize why the values of each layer in your model ended up the way they did, after training has completed.
Feedforward Neural Networks
These are the simplest form of neural networks in that they accept a fixed input length object and provide classification as an output. If the model extends beyond an input layer and an output layer (each with its own set of weights), it might have intermediate layers that are termed “hidden.” The feedforward aspect means that there are no backwards links to earlier layers: each node in the graph is connected in a forward fashion from input to output.
Recurrent Neural Networks
The next evolution in neural networks was to make them “recurrent.” That means that nodes closer to the output can conditionally link back to nodes in earlier layers—those closer to the input in the inference or classification pipeline. This back-linking means that some neural networks can be “unrolled” into simpler feedforward networks, whereas some connections mean they cannot. The variable complexity resulting from the classification taking loops or not means that inference time can vary from one classification run to the next, but these models can perform inference on inputs of varying lengths. (Thus not requiring the “fitting” mentioned above in the tabular data section.)
Convolutional Neural Networks
Convolutional Neural Nets have now been practically useful for roughly 10 years, including on higher resolution color imagery, in large part thanks to graphics processors (GPUs). There’s much to discuss with CNNs, so let’s start with the name. Neural nets imply a graph architecture that mimics the interconnected neurons of the brain. While organic neurons are very much analog devices, they do send signals with electricity much like silicon-based computers. The human vision system is massively parallel, so it makes sense that a parallel computing architecture like a GPU is properly suited to compute the training updates for a vision model, and even perform inference (detection or classification, for example) with the trained model. Finally, the word “convolution” refers to a mathematical technique that includes both multiplication and addition, which we’ll describe in greater detail in a later section on model layers.
Long Short-Term Memory Networks (LSTMs)
If you’ve been following along thus far, you might notice that most models you’ve encountered up until this point have no notion of “memory” from one classification to the next. Every “inference” performed at runtime is entirely dependent on the model, with no influence from any other inference that came before it. That’s where “Long Short Term Networks” or LSTMs come in. These networks have a “gate” at each node in the network that allows the weights to remain unchanged. This can sometimes mitigate the “vanishing gradient” problem in RNNs, in which all weights in a layer might “vanish” to 0 if a change of the weight is mandatory in every epoch (a single step-wise update of all of the weights in the model based on the output loss function) of the training run. Because weights can persist for many epochs, this is similar to physiological “short term memory,” encoded in the synaptic connections in the human brain: some are weakened or strengthened or left alone as time passes. Let’s turn back to applications from theory, though: some earlier influential LSTMs became renowned for their ability to detect cat faces in large YouTube-derived datasets like YouTube-8M. Eventually the model could operate in reverse, recalling the rough outline of a cat face, given the “cat” label as an input.
Q-Learning
Q-Learning is a “model-free” approach to learning, using reinforcement with an “objective function” to guide updates to the layers in the network. DeepMind instituted this process in their famous competition against world champion Lee Sedol at the game of Go. Q-Learning has since shown to be incredibly successful at learning other historically significant Atari Games, as well as RPG strategy games like WarCraft and StarCraft.

—OpenAI and MuJoCo, from OpenAI Gym
Word2Vec
Thus far, we haven’t spent much time or words on text models, so it’s time to begin with Word2Vec. Word2Vec is a series of models that matches word pairs with cosine similarity scores, so that they can be productively mapped in vector space. Word2Vec can produce a distributed representation of words or a continuous “bag-of-words,” or a continuous “skip-gram.” CBOW is faster, while “skip-gram” seems to handle infrequent words better. You can learn more about word2vec in the documentation for the industry-standard gensim Python package. If you’re looking to synthesize biological sequences like DNA, RNA, or even proteins, word2vec also proves useful in these scenarios: it handles sequences of biological and molecular data in the same way it does words.


Transformers
After LSTMs and RNNs reigned as the state of the art for natural language processing for several years, in 2017, a group of researchers at Google Brain formulated a set of multi-head attention layers that began to perform unreasonably well on translation workloads. These “attention units” typically consist of a scaled dot product. The only drawback to this architecture was that training on large datasets and verifying performance on long input strings during training was both computationally intensive and time-consuming. Attention(Q,K,V), or attention as a function of the matrices Q, K, and V, is equivalent to the softmax(QKT/sqrt(dk))*V. Modifications to this approach are typically focused on reducing the computational complexity from O(N2) to O(N ln N) with Reformers, or to O(N) with ETC or BigBird, where N is the input sequence length. The larger, in the case of BERT, “teacher” model is typically a product of self-supervised learning, starting with unsupervised pre-training on large Internet corpuses, followed by supervised fine-tuning. Common tasks that transformers can reliably perform include:
- Paraphrasing
- Question Answering
- Reading Comprehension
- Sentiment Analysis
- Next Sentence Prediction/Synthesis
As the authors of this model class named their 2017 research paper, “Attention Is All You Need.” This title was prescient, as transformers are now a lower-case category, and have influenced vision systems as well (known as Vision Transformers, or ViT for short), CLIP, DALL-E, GPT, BLOOM, and other highly influential models. Next, we’ll jump into a series of specific and canonically influential models—you’ll find the transformer-based models at the end of the list that follows.
What are some commonly used models?
AlexNet (historical)
AlexNet was the model that demonstrated that compute power and convolutional neural nets could scale to classify as many as 1000 different classes in Stanford’s canonical ImageNet dataset (and corresponding image classification challenge). The model consisted of a series of activation layers, hidden layers, ReLUs, and some “pooling” layers, all of which we’ll describe in a later section. It was the first widely reproduced model to be trained on graphics processors (GPUs), two NVIDIA GTX 580s, to be specific. Nearly every successor model past this point was also trained on GPUs. AlexNet won the 2012 N(eur)IPS ImageNet challenge, and it would become the inspiration for many successor state-of-the-art networks like ResNet, RetinaNet, and EfficientDet. Whereas predecessor neural networks such as Yann LeCun’s LeNet could perform fairly reliable classification into 10 classes (mapping to the 10 decimal digits), AlexNet could classify images into any of 1000 different classes, complete with confidence scores.
Original paper here and Papers with Code.
ResNet
Residual Networks—ResNet for short—encompass a series of image classifiers of varying depth. (Depth, here, roughly scaling with classification accuracy and also compute time.) Often, while training networks like AlexNet, the model won’t converge reliably. This “exploding/vanishing” (“vanishing” was explained above, while “exploding” means the floating point value rapidly increases to the maximum range of the data type) gradient problem becomes more challenging as the ML engineer adds more layers or blocks to the model. Compared to previous “VGG” nets of comparable accuracy, designating certain layers as residual functions drastically reduces complexity, enabling models up to 152 layers deep that still have reasonable performance in terms of inference time and memory usage. In 2015, ResNet set the standard by winning 1st place in ImageNet and COCO competitions, for detection, localization, and segmentation.
Original paper here on arXiv and Papers with Code.
Single Shot MultiBox Detector (SSD)
Single Shot MultiBox Detector (SSD) was published at NeurIPS in 2015 by Wei Li, a member of Facebook AI Research. Written in Caffe, it provided an efficient way for neural networks to also identify bounding boxes for objects, and dynamically update the object detection bounding boxes dynamically. While AlexNet proved the value of image classification to the market, SSD paved the way for increasingly accurate forms of object detection, even at high frame rates, as can be found on a webcam (60 frames per second or “FPS”), autonomous vehicle, or security camera (often lower, perhaps 24 FPS).
Original paper here on arXiv and Papers with Code.
Faster R-CNN and Mask R-CNN
It’s helpful to have bounding boxes (rectangles, usually) to identify objects in an image, but sometimes even greater levels of detail can be useful. Happily, it’s possible to train models on datasets that include images and matching per-pixel labels. Faster R-CNN (published by Matterport, a residential 3D scanning company) and its spiritual successor, Mask R-CNN are notable and iterative models that generate robust pixel-wise predictions in a relatively short amount of time. These models can be particularly useful for robotics applications such as picking objects out of boxes, or re-arranging objects in a scene. Mask R-CNN was published under the name “Detectron” on GitHub by Facebook AI Research. We’ll cover its successor, Detectron2, below.
Original papers here and here on arXiv and here and here on Papers with Code.
You Only Look Once (YOLO) and YOLOv3
Along the lines of SSD, mentioned above, Joseph Redmon at University of Washington decided to eschew the then-burgeoning TensorFlow framework in favor of hand-coding another “single shot” object detector with the goal of making it run extremely fast on C and CUDA (GPU programming language) code alone. His model architecture lives on in the form of Ultralytics, a business organized around deploying YOLOv5, now in PyTorch, models (currently) to customers. YOLO is an architecture that has stood the test of time and is somewhat relevant today, pushing the barriers of high-quality and high-speed object detection.
Original papers here and here on arXiv and here and here Papers with Code.
Inception v3 (2015), RetinaNet (2017) and EfficientDet (2020)
In the decade that has passed since Alex Krizhevsky published AlexNet at the University of Toronto, every few years a new model would win in the annual ImageNet and MSCOCO challenges. Today, it may seem as though high-accuracy, high-speed object detection is a “solved” problem, but of course there is always room to discover smaller, simpler models that might perform better on speed, quality, or some other metric. There have been some steps forward in the state-of-the-art for objective detection based on “neural architecture search,” or using a model to select different configurations and sizes and types of layers. Yet, today’s best models learn from those experiments but no longer explicitly search for better model configurations with an independent ML model.
Original papers for Inception v3, RetinaNet, and EfficientDet are available on arXiv. You can find Inception v3, RetinaNet, and EfficientDet on Papers with Code as well.

Mingxing Tan et al., 2020, Comparative Performance of Several Models
U-Net
Deep neural networks found a very effective use case in the field of diabetic retinopathy (eye scans to diagnose diabetes), and thus a number of research efforts sprung up to pursue semantic segmentation models for other forms of radiology, including medical scans like Computerized Tomography (CT scans). U-Net, also originally implemented in Caffe, used a somewhat novel mirror-image architecture with high dimensional inputs and outputs corresponding to the per-pixel labels, for both cell detection and anomaly labeling. U-Net is still relevant today, as it has influenced more modern models like Detectron2 and even Stable Diffusion for both semantic segmentation and image generation.
Original paper here on Springer and Papers with Code.
Detectron2
In 2019, Facebook Research moved the bar up another notch for both semantic segmentation and object detection with the same paper, Panoptic-DeepLab, freely available in the Detectron2 repository on GitHub. Today, Detectron2 serves as a state-of-the-art low latency, high accuracy gold standard for semantic segmentation and object detection. It seems that every couple years or so the state of the art is solidly surpassed, sometimes at the expense of compute resources, but soon enough these models are compressed or simplified to aid in deployment and training. Perhaps in another couple years, compute and memory requirements will be drastically reduced, or an indisputably higher quality benchmark will be established.
Original paper here on arXiv and Papers with Code.
GPT-3
Switching over to Natural Language use cases, the industry focused much on translation, with popular services like Google Translate developing markedly better performance thanks to switching from word-by-word translations through dictionary lookups to using transformer models to predict the next word of the translated output, using the entire source phrase as an input. Outside of Google, the next watershed moment for Natural Language computation came with OpenAI’s GPT. Trained on Common Crawl, Wikipedia, Books and web text-based datasets, the final trained model size includes 175 billion parameters, a decisively large model for its time, and still an impressive size as of writing. Just as BERT (a predecessor transformer-based language model) inspired more efficient, smaller derivatives before it, since GPT-3’s release in 2020, the smaller InstructGPT model, released in 2022, has shown a greater ability to respond to directive prompts instructing the model to do something specific, all while including fewer weights. One important ingredient in this mix is RLHF, or Reinforcement Learning from Human Feedback, which seems to enable enhanced performance despite a smaller model size. A modified version of GPT-3 is trained on public Python code available in GitHub, known as Codex, and is deployed as “GitHub Copilot” in Microsoft’s open source (VS) Code integrated development environment (IDE).
Original paper here on arXiv and GitHub.
BLOOM
BLOOM is a new model intended to serve a similar purpose to GPT-3, as a large language model for text generation, except the entire model and training dataset are all open-source and freely accessible, unlike GPT-3, which is only shared by OpenAI and Microsoft as a paid API. The trained BLOOM model (full size) is 176 billion parameters, meaning to train it from scratch required significant compute expense, so the model was trained on the IDRIS supercomputer near Paris, as part of a BigScience workshop organized in conjunction with government and academic research entities in France. BLOOM is arguably more accessible for model fine-tuning for specific applications because the training data and model code are all fully open source.
Code and context available on the HuggingFace blog.
The following section includes references to paper covered in our guide to Diffusion Models.
DALL·E 2
Released in 2022, DALL·E 2 impressed the industry with its uncanny ability to generate plausible, aesthetically pleasing imagery from text prompts. Similar to GPT-3 and running LSTM classification inference in reverse, DALL·E 2 is trained on a large dataset of imagery and its corresponding description text, scraped from the internet. Faces are removed, and a combination of models and human feedback are used to remove explicit or unsafe content, in order to decrease the likelihood that the model generates unsafe outputs. DALL·E 2 is available as an API, and is used in Microsoft’s Designer and Image Creator apps.
The DALL·E 2 paper is available on the OpenAI website. Unfortunately no model code is available at this time.
Stable Diffusion
In contrast to OpenAI’s approach with DALL·E 2, a group of startup founders including Stability.ai’s Emad Mostaque and the members of RunwayML decided to publish a functionally equivalent model and dataset to OpenAI’s API. Termed Stable Diffusion, it builds on a technique developed a year earlier at LMU in Munich, using a series of noising and denoising steps to generate images from text prompts, or from a source image. It is also possible to use a technique called “outpainting,” to ask the model to expand the canvas in a specified direction, running the model additional times to synthesize tiles that expand the final image. The model architecture embeds U-net as a central component, and can be run on consumer graphics cards. (More than 10GB VRAM is preferred.) Unlike DALL·E 2, since Stable Diffusion can be run locally on the user’s workstation, harmful text prompts are not filtered. Thus it is mainly the responsibility of the user to only use the model for productive rather than harmful purposes.
Code and context available on the HuggingFace model card page.
What are some common layer types in ML models?
Input Image (or other input matrix p × q × r)
Typically images are scaled or padded to fixed input dimensions, 256 × 256 pixels, for example. Modern networks can learn from color values as well as brightness, so models can focus on learning features from color data as well as brightness. Color cues might be essential to classify one bird species from another, or classify one type of manufacturing defect from another. If an input image to a model at inference time is too large to scale down to 256 × 256 without losing significant information, the image can be iteratively scanned, one row at a time, with a sliding classification box that strobes the image from left to right, top to bottom. The distance between one scanning window’s location and the next one is known as the model’s “stride.” Meanwhile, on the training side, much research has been devoted to identifying the best ways to crop, scale, and pad images with “blank” or “zero” pixels such that the model remains robust at inference time.
Convolutional Layer
A convolution is a signal processing technique and mathematical function that takes two signals as inputs and outputs the integral of the product of those two functions. The second input function is flipped across the y-axis and moved along the baseline of the function to calculate the output signal over time. Convolutions are a common function for switching a signal between the time domain and frequency domain, or vice versa.
Sigmoid
With a linear activation function, a neural network can only learn a linear boundary between one class and the other. Many real-world problems have non-linear boundaries between the classes, and thus they need a non-linear activation function.
ReLU (Rectification Linear Unit)
A ReLU layer is computationally much simpler than a sigmoid as it simply sets all negative number inputs to zero and scales all positive inputs linearly. It turns out that ReLU is a fast, functional replacement for the much more expensive Sigmoid function to connect layers.
Pooling layer
This layer down-samples features found in various regions of the image, making the output “invariant” to minor translations of objects or features found in the image. While it is possible to adjust the “stride” of a convolution (the distance between the centers of successive convolution “windows”), it is more common to use a pooling layer to “summarize” the features discovered in each region of an image.
Softmax
This is typically the last activation function of a neural network, used to normalize the output to a probability distribution over predicted output classes. It basically translates the intermediate values of the neural network back into the intended output labels or values. In a way, the softmax function is the reverse of any input-side transformation the training and test data may undergo (downsampling to 256×256 resolution, or conversion from an integer to a floating point number between 0 and 1, for example).
ML Modeling Frameworks/Libraries
Scikit-learn
Scikit-learn is a Swiss Army knife of frameworks in that it supports everything from linear regression to convolutional neural networks. ML engineers typically don’t use it for bleeding edge research or production systems, except for maybe its data loaders. That said, it is incredibly useful for establishing utilities for splitting training from evaluation and test data, and loading a number of base datasets. It also can be used for baseline (simpler) models, such as Support Vector Machines (SVMs) and Histogram of Gradients (HOGs) to use as sanity checks, comparing algorithmically and computationally simpler models to their more sophisticated, modern counterparts.
XGBoost
XGBoost is a framework that allows the simple training of gradient-boosted trees, which has also proven its value across a wide range of tabular use cases on Kaggle, an online data science competition platform. In many scenarios, before training an LSTM to make a prediction, it’s often helpful to try an Xgboost model first, to disprove the hypothesis that a simpler model might possibly do the job equally well, if not better.
Caffe (historical)
Yangqing Jia wrote the Caffe framework while he was a PhD student at UC Berkeley with the intent of giving developers access to machine learning primitives that he wrote in C++ and CUDA (GPU computation language), with a Python interface. The framework was primarily useful for image classification and image segmentation with CNNs, RCNNs, LSTMS, and fully connected networks. After hiring Jia, Facebook announced Caffe2, supporting RNNs for the first time, and then soon merged the codebase into PyTorch (covered two sections later).
TensorFlow
Originally launched in 2015, TensorFlow quickly became the standard machine learning framework for training and deploying complex ML systems, including with GPU support for both training and inference. While TensorFlow is still in use in many production systems, it has seen decreased popularity in the research community. In 2019, Google launched TensorFlow 2.0, merging Keras’ high-level capabilities into the framework. TensorBoard, a training and inference stats and visualization tool also saw preliminary support for PyTorch, a rival, more research-oriented framework.
PyTorch
Roughly a year after TensorFlow’s release, Facebook (now Meta) released PyTorch, a port of NYU’s Torch library from the relatively obscure Lua programming language over to Python, hoping for mainstream adoption. Since its release, PyTorch has seen consistent growth in the research community. As a baseline, it supports scientific computing in a manner similar to NumPy, but accelerated on GPUs. Generally speaking, it offers more flexibility to update a model’s graph during training, or even inference. For some researchers the ability to perform a graph update is invaluable. The PyTorch team (in conjunction with FAIR, the Facebook AI Research group) also launched ONNX with Microsoft, an open source model definition standard, so that models could be ported easily from one framework to another. In September 2022, Meta (formerly Facebook) announced that PyTorch development would be orchestrated by its own newly created PyTorch Foundation, a subsidiary of the Linux Foundation.
Keras (non-TensorFlow support is now only historical)
In 2015, around the same time as TensorFlow’s release, ML researcher François Chollet released Keras, in order to provide an “opinionated” library that helps novices get started with just a few lines of code, and experts deploy best-practices prefab model elements without having to concern themselves with the details. Initially, Keras supported multiple back-ends in the form of TensorFlow, Theano, and Microsoft’s Cognitive Toolkit (CNTK). After Chollet joined Google a few years later, Keras was integrated into the version 2.0 release of TensorFlow, although its primitives can still be called via its standalone Python library. Chollet still maintains the library with the goal of automating as many “standard” parts of ML training and serving as possible.
Chainer (historical)
Initially released in 2015 by the team at Japan’s Preferred Networks, Chainer was a Python-native framework especially useful in robotics. While TensorFlow had previously set the industry standard, using the “define then run” modality, Chainer pioneered the “define-by-run” approach, which meant that every run of a model could redefine its architecture, rather than relying on a separate static model definition. This approach is similar to that of PyTorch, and ultimately Chainer was re-written in PyTorch to adapt to a framework that was independently growing in popularity and matched some of Chainer’s founding design principles.
YOLOv5
Although we have already addressed YOLO above as a model class, there have been significant updates to this class, with YOLOv5 serving as Ultralytics’ PyTorch rewrite of the classic model. Ultimately, the rewrite uses 25% of the memory of the original and is designed for production applications. YOLO’s continued support and development is a testament to the efficiency of its original design, and successive iterations of YOLO continue to provide competitive, high-performance semantic segmentation and object detection, even though the original model consciously chose to avoid using industry-standard frameworks like TensorFlow and PyTorch.
HuggingFace Transformers
The world of machine learning frameworks and libraries is no stranger to higher-level abstractions, with Keras serving as the original example of this approach. A wider ranging, and more recent attempt at building a higher level abstraction library for dataset ingest, training, and inference, Transformers provides easy few-line access in Python to numerous APIs in Natural Language, Computer Vision, audio, and multimodal models. Transformers also offers cross-compatibility with PyTorch, TensorFlow, and Jax. It is accessible through a number of sample Colab notebooks as well as on HuggingFace’s own Spaces compute platform.
Choosing Model Metrics
When choosing metrics for training your model, you might start with loss, or default to accuracy, but it’s also important to make sure that your metrics are business-aligned. This might mean that you care more about certain classification or prediction scenarios than others. In effect, you will want to use your model to match your business priorities, rather than distract or detract from them.
Minimizing loss
As you train the model and use a function to continuously update the weights in various layers, the goal is always to “minimize loss” in the form of error. Thus, if loss is continuously decreasing, your model is converging towards a state in which it might make useful classifications or predictions. Loss is the metric the model attempts to optimize as it iteratively updates the individual weights—the end user of a model might care more about other, derivative metrics like accuracy, precision, and recall, which we’ll discuss next.
Maximizing accuracy
Every few “epochs” or so of updating the weights in the layers of your neural network, you’ll evaluate the model’s accuracy to ensure that it’s improved rather than degraded. Typically accuracy is evaluated on the test set and the training set to make sure that performance isn’t maximized on one but minimized on the other (this is known as “overfitting” where the model performs well on training data, but can’t generalize to the test data it’s never seen before).
Precision and recall
Precision defines what proportion of positive identifications were correct and is calculated as follows:
Precision = True Positives / True Positives + False Positives
- A model that produces no false positives has a recall of 1.0
Recall defines what proportion of actual positives were identified correctly by the model, and is calculated as follows:
Recall = True Positives / True Positives + False Negatives
- A model that produces no false negatives has a recall of 1.0
F-1 or F-n Score
F-1 is the harmonic mean of Precision and Recall, so it’s a metric that factors both scores as inputs. Perfect precision and recall yields a score of 1.0, while F-1 is 0 if either precision or recall is 0. Adopting a different value of n means you can bias the harmonic mean to favor either precision or recall depending on your needs. For example, do you care more about reducing false positives or false negatives? Safety-critical scenarios will often dictate that you can tolerate false negatives or false positives but not vice versa.
IoU (Intersection-over-Union)
This metric is the percent overlap between ground truth (data annotations) and the model’s prediction. This typically applies for 2D and 3D object detection problems, and is particularly relevant for autonomous vehicles. You can think of two intersecting rectangles, where the intersection represents your success target: the bigger the overlap, the better the model. IoU is the quotient (the result from division) when you divide the overlap over the total area of the prediction and ground truth bounding rectangles. This concept can be projected to three dimensions in which the target is the overlap of two rectilinear prisms, and the IoU is the quotient of this overlapping prism over the sum of the two intersection prisms (again, ground truth and prediction).
Best Practices and Considerations for Training Models
Time required
Some simpler model architectures and smaller datasets are quicker to train on, while more complex model architectures, higher input resolutions, and larger datasets are typically slower to train on—they take more time to achieve a performant model. While training on larger and larger datasets is helpful to improve model accuracy (datasets on the order of magnitude of 1,000 or even 10,000 samples are a great place to start), at a certain point, if additional data samples contain no new information, they aren’t apt to improve model performance, particularly on rare or “edge” cases.
Model size
A secondary constraint can be the total sum of weights in the model. The more weights, the more (GPU) memory the model consumes as it trains. There’s greater complexity when you train a model that spans multiple memory banks, whether that’s across one system to the next, or across GPUs in the same system.
Dataset size
A minimum of 1000 images or 1000 rows of tabular data is typically required to build a meaningful model. It may be possible to train on fewer examples, but with small datasets you are likely to encounter problems with your model like overfitting. Ideally your use case has the largest order-of-magnitude dataset size available to you, and if it does not, you may consider approaches like synthetic augmentation or simulation.
Compute requirements
Typically, training models is both compute-intensive and time-consuming. Some models are suited to distributed training (particularly if the graph can be frozen and copied across multiple machines). Logistic regressions are comparatively easy and quick to run, as well as SVMs, whereas larger models can take hours or even days to train. Even rudimentary algorithms like K-means clustering can be computationally intensive, and can see vast speedups from training and inference on GPUs or even dedicated AI hardware like TPUs.
Spectrum: easy to hard
- If there’s a simpler tool for the job, it makes sense to start there. Simpler models include logistic regression and support vector machines (SVMs).
- That said, the boundaries between one class and another won’t always be smooth or “differentiable” or “continuous.” So sometimes more sophisticated tools are necessary.
- Random Forest Classifiers and Gradient Boosted Machines can typically handle easier vision classification challenges like the Iris dataset.
- Larger datasets and larger models typically require longer training times. So when more sophisticated models are required to achieve high accuracy, often it will take more time, more fine tuning, and a larger compute/storage budget to train the next model.
Conclusion
Training ML models shouldn’t be a daunting proposition given the cornucopia of tools available today. It’s never been easier to get started.That said, don’t let ML be a solution looking for a problem. Prove that simpler approaches aren’t more effective before you land on a CNN, DNN, or LSTM as the ideal solution. Training models can feel like an art, but really it’s best to approach it as a science. Hyperparameters and quirks in your dataset may dictate outcomes, but dig in deeper: eventually you can prove or disprove hypotheses as to what’s causing better or worse outcomes. It’s important to be systematic and methodical: it’s best to keep an engineer’s log or notebook. Tracking hyperparameters in GitHub is largely insufficient; you should ruthlessly test your models for regressions, track aggregate and specific performance metrics, and use tooling where possible to take the tediousness out of writing scripts.
Additional Resources
Product Pages:
- Scale Launch and ML Infrastructure
- Scale InstantML
- Scale Nucleus
Use Cases:
Blog Posts:
Diffusion Models: A Practical Guide
Diffusion models have the power to generate any image that you can imagine. This is the guide you need to ensure you can use them to your advantage whether you are a creative artist, software developer, or business executive.
Contents
Introduction
With the Release of Dall-E 2, Google’s Imagen, Stable Diffusion, and Midjourney, diffusion models have taken the world by storm, inspiring creativity and pushing the boundaries of machine learning.
These models can generate a near-infinite variety of images from text prompts, including the photo-realistic, the fantastical, the futuristic, and of course the adorable.

These capabilities redefine what it means for humanity to interact with silicon, giving us superpowers to generate almost any image that we can imagine. Even with their advanced capabilities, diffusion models do have limitations which we will cover later in the guide. But as these models are continuously improved or the next generative paradigm takes over, they will enable humanity to create images, videos, and other immersive experiences with simply a thought.
In this guide, we explore diffusion models, how they work, their practical applications, and what the future may have in store.
What are Diffusion Models?
Generative models are a class of machine learning models that can generate new data based on training data. Other generative models include Generative adversarial networks (GANs), Variational Autoencoders (VAEs), and Flow-based models. Each can produce high-quality images, but they all have limitations that make them inferior to diffusion models.
At a high level, Diffusion models work by destroying training data by adding noise and then learn to recover the data by reversing this noising process. In Other words, Diffusion models can generate coherent images from noise.
Diffusion models train by adding noise to images, which the model then learns how to remove. The model then applies this denoising process to random seeds to generate realistic images.


Combined with text-to-image guidance, these models can be used to create a near-infinite variety of images from text alone by conditioning the image generation process. Inputs from embeddings like CLIP can guide the seeds to provide powerful text-to-image capabilities.
Diffusion models can complete various tasks, including image generation, image denoising, inpainting, outpainting, and bit diffusion.
Popular diffusion models include Open AI’s Dall-E 2, Google’s Imagen, and Stability AI's Stable Diffusion.
- Dall-E 2: Dall-E 2 revealed in April 2022, generated even more realistic images at higher resolutions than the original Dall-E. As of September 28, 2022 Dall-E 2 is open to the public on the OpenAI website, with a limited number of free images and additional images available for purchase.
- Imagen is Google’s May 2022, version of a text-to-image diffusion model, which is not available to the public.
- Stable Diffusion: In August 2022, Stability AI released Stable Diffusion, an open-source Diffusion model similar to Dall-E 2 and Imagen. Stability AI’s released open source code and model weights, opening up the models to the entire AI community. Stable Diffusion was trained on an open dataset, using the 2 billion English label subset of the CLIP-filtered image-text pairs open dataset LAION 5b, a general crawl of the internet created by the German charity LAION.
- Midjourney is another diffusion model released in July 2022 and available via API and a discord bot.
Simply put, Diffusion models are generative tools that enable users to create almost any image they can imagine.
Diffusion Models: Why Are They Important?
Diffusion models represent that zenith of generative capabilities today. However, these models stand on the shoulders of giants, owing their success to over a decade of advancements in machine learning techniques, the widespread availability of massive amounts of image data, and improved hardware.
For some context, below is a brief outline of significant machine learning developments.
- In 2009 at CVPR, the seminal Imagenet paper and dataset were released, which contained over 14 million hand-annotated images. This dataset was massive at the time, and it remains relevant to researchers and businesses building models today.
- In 2014, GANs were introduced by Ian Goodfellow, establishing powerful generative capabilities for machine learning models.
- In 2018 LLM’s hit the scene with the original GPT release, followed shortly by its successors GPT-2, and the current GPT-3, which have text generation capabilities.
- In 2020, NeRFs allowed the world to produce 3D objects from a series of images, and known camera poses.
- Over the past few years, Diffusion models have continued this evolution, giving us even more powerful generative capabilities.
What about diffusion models makes them so strikingly different from their predecessors? The most apparent answer is their ability to generate highly realistic imagery and match the distribution of real images better than GANs. Also, diffusion models are more stable than GANs, which are subject to mode collapse, where they only represent a few modes of the true distribution of data after training. This mode collapse means that in the extreme case, only a single image would be returned for any prompt, though the issue is not quite as extreme in practice. Diffusion models avoid the problem as the diffusion process smooths out the distribution, resulting in diffusion models having more diversity in imagery than GANs.
Diffusion models also can be conditioned on a wide variety of inputs, such as text for text-to-image generation, bounding boxes for layout-to-image generation, masked images for inpainting, and lower-resolution images for super-resolution.
The applications for diffusion models are vast, and the practical uses of these models are still evolving. These models will greatly impact Retail and eCommerce, Entertainment, Social Media, AR/VR, Marketing, and more.
Getting Started with Diffusion Models
Web applications such Open AI’s Dall-E 2 and Stable Diffusion’s DreamStudio make diffusion models readily available. These tools provide a quick and easy way for beginners to start with diffusion models, allowing you to generate images with prompts and perform inpainting and outpainting. DreamStudio offers more control over the output parameters, while Dall-E 2’s interface is simpler with fewer frills. Each platform provides free credits to new users, but will charge a usage fee once those credits are depleted.
- DreamStudio: DreamStudio from Stability AI is a quick way for users to experience Stable Diffusion without worrying about the infrastructure details. There is tooling for image generation, inpainting, and outpainting. Uniquely, the interface enables users to specify a random seed, providing the ability to traverse the latent space while holding a prompt fixed (more to come on this later). New users get 200 free credits.
- Dall-E 2: OpenAI recently announced that Dall-E 2 is now generally available to all users, coming out of its previously closed beta. Dall-E 2 provides a simple user interface without many frills, to generate images, inpainting, and outpainting.
- Local Installation:
- Stability AI broke headlines when it announced that it was open-sourcing both the model weights and source code for its Diffusion model Stable Diffusion.
- You can download and install Stable Diffusion on your local computer and integrate its capabilities into applications and workflows.
- Other models, such as Dall-E 2, are currently only available via API or web app as their models are not open-source like Stable Diffusion.
You can also search through a wide array of curated images on aggregation sites like Lexica.art, which provides an even easier way to get started and get inspired by what the broader community has been creating and learn how to build better prompts
Diffusion Model Prompt Engineering
Prompts are how you can control the outputs for Diffusion models. Diffusion models are verbose and take two primary inputs and translate these into a fixed point in its model’s latent space, a seed integer, and a text prompt. The seed integer is generally automatically generated, and the user provides the text prompt. Continuous experimentation via Prompt engineering is critical to getting the perfect outcomes. We explored Dall-E 2 and Stable Diffusion and have consolidated our best tips and tricks to getting the most out of your prompts, including prompt length, artistic style, and key terms to help you sculpt the images you want to generate.
How to prompt
In general, there are three main components to a prompt:
Frame + Subject + Style + an optional Seed.
1. Frame - The frame of an image is the type of image to be generated. This is combined with the Style later in the prompt to provide an overall look and feel of the image. Examples of Frames include photograph, digital illustration, oil painting, pencil drawing, one-line drawing, and matte painting.
- The following examples are modified versions of the base prompt "Painting of a person in a Grocery Store," in the frame of an oil painting, a digital illustration, a realistic photo, and a 3D cartoon.

- Diffusion models typically default to a “picture” frame if not specified, though this is dependent on the subject matter. By specifying a frame of an image, you control the output directly.

- By modifying the frame to “Polaroid” you can mimic the output of a polaroid camera, complete with large white borders.

- Pencil Drawings can be produced as well.

- And as already covered, different painting techniques can be applied.

- Frames provide a rough guide for the output type the diffusion model should generate. But in order to create remarkable images, a good subject and refined style should also be added to your prompts. We will cover subjects next and then detail tips and tricks for combining frames, subjects, and styles to fine-tune your images.
2. Subject - The main subject for generated images can be anything you can dream up.
- Diffusion models are built largely from publicly available internet data and are able to produce highly accurate images of objects that exist in the real world.



- However, Diffusion models often struggle with compositionality, so ideally, limiting your prompts to one to two subjects is best.
- Sticking to one or two subjects produces generally good results, for example "Chef Chopping Carrots on a cutting board."

- Even though there is some confusion here with a knife chopping another knife, there are chopped carrots in the scene, which is generally close to the original prompt.
- However, expanding to more than two subjects can produce unreliable and sometimes humorous results:

- Diffusion models tend to fuse two subjects into a single subject if the subjects are less common. For example, the prompt “a giraffe and an elephant” yields a giraffe-elephant hybrid rather than a single giraffe and a single elephant. Interestingly, there are often two animals in the scene, but each is typically a hybrid.

- Some attempts to prevent this, including adding in a preposition like “beside,” have mixed results but are closer to the original intent of the prompt.

- This issue appears subject-dependent, as a more popular pair of animals, such as “a dog and a cat,” generates distinct animals without a problem.

3. Style - The style of an image has several facets, key ones being the lighting, the theme, the art influence, or the period.
- Details such as “Beautifully lit”, “Modern Cinema”, or “Surrealist”, will all influence the final output of the image.
- Referring back to the prompt of "chefs chopping carrots," we can influence this simple image by applying new styles. Here we see a “modern film look” applied to the frames of “Oil Painting” and “Picture.”


- The tone of the images can be shaped by a style, here we see “spooky lighting.”

- You can fine-tune the look of the resulting images by slightly modifying the style. We start with a blank slate of “a house in a suburban neighborhood.”

- By adding “beautifully lit surrealist art” we get much more dynamic and intense images.

- Tweaking this we can get a spooky theme to the images by replacing “beautifully lit” with the phrase “spooky scary.”

- Apply this to a different frame to get the desired output, here we see the same prompt with the frame of an oil painting.

- We can then alter the tone to “happy light” and see the dramatic difference in the output.

- You can change the art style to further refine the images, in this case switching from “surrealist art” to “art nouveau.”

- As another demonstration of how the frame influences the output, here we switch to “watercolor” with the same style.

- Different seasons can be applied to images to influence the setting and tone of the image.

- There is a near-infinite variety of combinations of frames and styles and we only scratch the surface here.




- Artists can be used to fine-tune your prompts as well. The following are versions of the same prompt, "person shopping at a grocery store," styled to look like works of art from famous historic painters.

- By applying different styles and frames along with an artist, you can create novel artwork.
- Start with a base prompt of “painting of a human cyborg in a city {artist} 8K highly detailed.”

- While the subject is a bit unorthodox for this group, each painting fits the expected style profile of each artist.
- We can alter the style by modifying the tone, in this case, to “muted tones”:

- You can further alter the output by modifying both the frame and the tone to get unique results, in this case, a frame of a “3D model painting” with neon tones.

- Adding the qualifier, “the most beautiful image you’ve ever seen” yields eye-catching results.

- And depictions such as “3D model paintings” yield unique, novel works of art.

- By modifying the frame and style of the image, you can yield some amazing and novel results. Try different combinations of style modifiers, including “dramatic lighting”, or “washed colors” in addition to the examples that we provided to fine-tune your concepts further.
- We hardly scratched the surface in this guide, and look forward to amazing new creations from the community.
4. Seed
- A combination of the same seed, same prompt, and same version of Stable Diffusion will always result in the same image.
- If you are getting different images for the same prompt, it is likely caused by using a random seed instead of a fixed seed. For example, "Bright orange tennis shoes, realistic lighting e-commerce website" can be varied by modifying the value of the random seed.

- Changing any of these values will result in a different image. You can hold the prompt or seed in place and traverse the latent space by changing the other variable. This method provides a deterministic way to find similar images and vary the images slightly.
- Varying the prompt to "bright blue suede dress shoes, realistic lighting e-commerce website" and holding the seed in place at 3732591490 produces results with similar compositions but matching the desired prompt. And again, holding that prompt in place and traversing the latent space by changing the seed produces different variations:

To summarize a good way to structure your prompts is to include the elements of “[frame] [main subject] [style type] [modifiers]” or “A [frame type] of a [main subject], [style example]” And an optional seed. The order of these exact phrases may alter your outcome, so if you are looking for a particular result it is best to experiment with all of these values until you are satisfied with the result.
4. Prompt Length
Generally, prompts should be just as verbose as you need them to be to get the desired result. It is best to start with a simple prompt to experiment with the results returned and then refine your prompts, extending the length as needed.
However, many fine-tuned prompts already exist that should be reused or modified.
Modifiers such as "ultra-realistic," "octane render," and "unreal engine" tend to help refine the quality of images, as you can see in some of the examples below.
- “A female daytrader with glasses in a clean home office at her computer working looking out the window, ultra realistic, concept art, intricate details, serious, highly detailed, photorealistic, octane render, 8 k, unreal engine”

- “portrait photo of a man staring serious eyes with green, purple and pink facepaint, 50mm portrait photography, hard rim lighting photography–beta –ar 2:3 –beta –upbeta”

- “Extremely detailed wide angle photograph, atmospheric, night, reflections, award winning contemporary modern interior design apartment living room, cozy and calm, fabrics and textiles, geometric wood carvings, colorful accents, reflective brass and copper decorations, reading nook, many light sources, lamps, oiled hardwood floors, color sorted book shelves, couch, tv, desk, plants”

- “Hyperrealistic and heavy detailed fashion week runway show in the year 2050, leica sl2 50mm, vivid color, high quality, high textured, real life”

- “Full-body cyberpunk style sculpture of a young handsome colombian prince half android with a chest opening exposing circuitry and electric sparks, glowing pink eyes, crown of blue flowers, flowing salmon-colored silk, fabric, raptors. baroque elements. full-length view. baroque element. intricate artwork by caravaggio. many many birds birds on background. trending on artstation, octane render, cinematic lighting from the right, hyper realism, octane render, 8k, depth of field, 3d”

- “Architectural illustration of an awesome sunny day environment concept art on a cliff, architecture by kengo kuma with village, residential area, mixed development, high - rise made up staircases, balconies, full of clear glass facades, cgsociety, fantastic realism, artstation hq, cinematic, volumetric lighting, vray”

5. Additional Tips
A few additional items are worth mentioning.
Placing the primary subject of the image closer to the beginning of the prompt tends to ensure that subject is included in the image. For instance, compare the two prompts
- "A city street with a black velvet couch" at times will miss the intent of the prompt entirely and the resulting image will not include a couch.

- By rearranging the prompt to have the keyword "couch" closer to the beginning of the prompt, the resulting images will almost always contain a couch.

There are combinations of subject and location that tend to yield poor results. For instance, "A black Velvet Couch on the surface of the moon" yields uneven results, with different backgrounds and missing couches entirely. However, a similar prompt, "A black velvet couch in a desert" tends to reflect the intent of the prompt, capturing the velvet material, the black color, and the characteristics of the scene more accurately. Presumably, there are more desert images contained in the training data, making the model better at creating coherent scenes for deserts than the moon.


Prompt engineering is an ever-evolving topic, with new tips and tricks being uncovered daily. As more businesses discover the power of diffusion models to help solve their problems, it is likely that a new type of career, "Prompt Engineer" will emerge.
Diffusion Model Limitations
As powerful as they are, Diffusion models do have limitations, some of which we will explore here. Disclaimer: given the rapid pace of development, these limitations are noted as of October 2022.
- Face Distortion: Faces become substantially distorted when the number of subjects exceeds 3. For example, "a family of six in a conversation at a cafe looking at each other and holding coffee cups, a park in the background across the street leica sl2 50mm, vivid color, high quality, high textured, real life", the faces become substantially distorted.

- However, increasing the number of subjects in the prompt causes the faces to become substantially distorted. For example this updated prompt, “a family of six in a conversation at a cafe looking at each other and holding coffee cups, a park in the background across the street leica sl2 50mm, vivid color, high quality, high textured, real life,” results in the following:

- Text generation: In an ironic twist, diffusion models are notoriously bad at generating text within images, even though the images are generated from text prompts, which diffusion models handle well. For the prompt "a man at a conference wearing a black t shirt with the word SCALE written in neon text" the generated image will include words on the shirt in the best case, but will not recreate "Scale", in this case instead including the letters "Sc-sa Salee". In other cases, the words will be on signs, the wall, or not included at all. This will likely to be fixed in future versions of these models, but it is interesting to note.

- Limited Prompt Understanding: For some images it does require a lot of massaging of the prompt to get the desired output, reducing the potential efficiency of these models for a productivity tool, though they are still a net productivity add.
Diffusion Models: Additional Capabilities and Tooling
Diffusion models' flexibility gives them more capabilities than just pure image generation.
- Inpainting is an image editing capability that enables users to modify regions of an image and replace those regions with generated content from the diffusion model. The model references surrounding pixels to ensure that the generations fit the context of the original image. Many tools enable changes within an image (real-world images or generated images) by "erasing" or applying a mask to a specific image region and then asking the model to fill in the image with new content. With inpainting, you start with a real-world or generated image. In this case, the image is of a model leaning against a wall in a modern city.

- You then apply a "mask" to the areas of the image you would like to replace, similar to erasing areas of the image. First, we will replace the jacket that the model is wearing.

- Then, we will generate new clothing for the model to wear, in this case a beautiful fancy gown shiny with pink highlights.

- Or a bright purple and pink wool jacket with orange highlights and highly detailed.

- The images above maintain the original pose, but these models can also suggest new poses, in this case, moving the arm down to the side, as seen in this example, "A glowing neon jacket purple and pink"

- Diversity of materials is demonstrated as well with "A leather jacket with gold studs"

- And also with "A shiny translucent coat neon"

- Inpainting can also replace background elements, either by simply removing them and filling in the background, or generating new backgrounds that did not exist in the original image. Start with Masking out the poles in the background:

- To generate a new background, apply a mask to and generate new scenes, in this case "a futuristic cityscape with flying cars."

As you can see, inpainting is quite powerful for editing images quickly and generating new scenes dynamically.
In the future, expect the capabilities of these tools to be even more efficient, without requiring the user to edit a mask. By simply describing the desired edits, i.e., “Replace the background with a futuristic cityscape," the image will automatically be edited, with no mouse clicks or keystrokes needed.
2. Outpainting
Outpainting enables users to extend their creativity by continuing an image beyond its original borders - adding visual elements in the same style, or taking a story in new directions simply by using natural language description.
Starting with a real-world image or a generated image, you can extend that image beyond the original borders until you have a larger, coherent scene.
- Real-world image outpainting
- Start by uploading an image and selecting a region on the outside border where you would like to extend the image.

- Similar to inpainting, you then generate prompts that generate coherent scenes that extend the original image.

- Generated Image Inpainting works in much the same way. Start by generating a scene that will serve as the seed. Outside scenes lend themselves to more expansive outpainting, so we will start with “central park looking at the skyline with people on the grass in impressionist style."

- Now add in a man playing frisbee with his dog.

- Note that the prompt does not specify that the skyline should remain consistent with the original image. The model automatically maintains and extends the background style, so you can focus on adding the elements you want to see.
- Finally, we will add in a family having a picnic and we have our finished image.

Outpainting requires an extra layer of prompt refinement in order to generate coherent scenes, but enables you to quickly create large images that would take significantly longer to create with traditional methods.

Outpainting enables impressive amounts of creativity and the ability to build large-scale scenes that remain coherent in theme and style. Similar to inpainting, there is room to improve this capability by making it even simpler to generate the desired scenes by providing a single prompt and getting the exact image you are looking for.
3. Diffusion for Video Generation
As we have seen, generating static images is exciting and offers many practical applications. Several recently announced models take the capabilities of diffusion models and extend them to create videos. These capabilities have not yet made it to the hands of a broader audience but will be coming very soon.
- Meta’s Make-A-Video is a new AI system that lets people turn text prompts into brief, high-quality video clips.
- Google’s Imagen Video generates approximately 5 second videos in a similar fashion.
4. Diffusion Model Image Curation Sites
Curation sites like Lexica.art provide highly curated collections of generated images and the prompts that were used to create them. Millions of images are indexed, so there is a good chance that the image that you originally thought you would have to generate already exists and is just a quick search away. Searches are low latency, you don’t need to wait for one to two minutes for a diffusion model to generate images, and you get the images nearly instantly. This is great for experimenting or searching for types of images or exploring prompts. Lexica is also a great way to learn how to prompt to get the results you are looking for (include a couple of examples here)
Diffusion Models: Practical Applications for today and tomorrow
The obvious application for diffusion models is to be integrated into design tools to empower artists to be even more creative and efficient. In fact, the first wave of these tools has already been announced, including Microsoft Designer which integrates Dall-E 2 into its tooling. There are significant opportunities in the Retail and eCommerce space, with generative designs for products, fully generated catalogs, alternate angle generation, and much more.
Product design will be empowered with powerful new design tools, that will enhance their creativity and provide the capability to see what products look like in the context of homes, offices, and other scenes. With advancements in 3D diffusion, full 3D renders of products can be created with a prompt. Taking this to the extreme, these 3D renders can then be printed as a 3D model and come to life in the real world.
Marketing will be transformed, as ad creative can be dynamically generated, providing massive efficiency gains, and the ability to test different creatives will increase the effectiveness of ads.
The entertainment industry will begin incorporating diffusion models into special effects tooling, which will enable faster and more cost-effective productions. This will lead to more creative and wild entertainment concepts that are limited today due to the high costs of production. Similarly, Augmented and Virtual Reality experiences will be improved with the near-real-time content generation capabilities of the models. Users will be able to alter their world at will, with just the sound of their voice.
A new generation of tooling is being developed around these models, which will unlock a wide range of capabilities.
Conclusion
The vast capabilities of Diffusion models are inspiring, and we don't yet know the true extent of their limitations.
Foundation models are bound to expand their capabilities over time, and progress is accelerating rapidly. As these models improve, the way humanity interacts with machines will change fundamentally. As Roon stated in his blog Text is the Universal Interface, "soon, prompting may not look like "engineering" at all but a simple dialogue with the machine."
The opportunities for advancing our society, art, and business are plentiful, but technology needs to be embraced quickly to see these benefits. Businesses need to take advantage of this new functionality or risk falling dramatically behind. We look forward to a future where humans are a prompt away from creating anything we can imagine, unlocking unlimited productivity and creativity. The best time to get started on this journey is now, and we hope that this guide serves as a stong foundation for that journey.
Data Labeling: The Authoritative Guide
The success of your ML models is dependent on data and label quality. This is the guide you need to ensure you get the highest quality labels possible.
Data Labeling for Machine Learning

Machine learning has revolutionized our approach to solving problems in computer vision and natural language processing. Powered by enormous amounts of data, machine learning algorithms are incredibly good at learning and detecting patterns in data and making useful predictions, all without being explicitly programmed to do so.
Trained on large amounts of image data, computer models can predict objects with very high accuracy. They can recognize faces, cars, and fruit, all without requiring a human to write software programs explicitly dictating how to identify them.
Similarly, natural language processing models power modern voice assistants and chatbots we interact with daily. Trained on enormous amounts of audio and text data, these models can recognize speech, understand the context of written content, and translate between different languages.
Instead of engineers attempting to hand-code these capabilities into software, machine learning engineers program these models with a large amount of relevant, clean data. Data needs to be labeled to help models make these valuable predictions. Data labeling is one of machine learning's most critical and overlooked activities.
This guide aims to provide a comprehensive reference for data labeling and to share practical best practices derived from Scale's extensive experience in addressing the most significant problems in data labeling.
What is Data Labeling?
Data labeling is the activity of assigning context or meaning to data so that machine learning algorithms can learn from the labels to achieve the desired result.
To better understand data labeling, we will first review the types of machine learning and the different types of data to be labeled. Machine learning has three broad categories: supervised, unsupervised, and reinforcement learning. We will go into more detail about each type of machine learning in Why is Data Annotation Important?
Supervised machine learning algorithms leverage large amounts of labeled data to “train” neural networks or models to recognize patterns in the data that are useful for a given application. Data labelers define ground truth annotations to data, and machine learning engineers feed that data into a machine learning algorithm. For example, data labelers will label all cars in a given scene for an autonomous vehicle object recognition model. The machine learning model will then learn to identify patterns across the labeled dataset. These models then make predictions on never before seen data.
Types of Data
Structured vs. Unstructured Data
Structured data is highly organized, such as information in a relational database (RDBMS) or spreadsheet. Customer information, phone numbers, social security numbers, revenue, serial numbers, and product descriptions are structured data.
Unstructured data is data that is not structured via predefined schemas and includes things like images, videos, LiDAR, Radar, some text data, and audio data.
Images
Camera sensors output data initially in raw format and then converted to .png or preferably .jpg files, which are compressed and take up less storage than .png, which is a serious consideration when dealing with the large amounts of data needed to train machine learning models. Image data is also scraped from the internet or collected by 3rd party services. Image data powers many applications, from face recognition to manufacturing defect detection to diagnostic imaging.

Videos
Video data also come from camera sensors in raw format and consist of a series of frames stored as .mp4, .mov, or other video file formats. MP4 is a standard in machine learning applications due to its smaller file size, similar to .jpg for image data. Video data enables applications like autonomous vehicles and fitness apps.
3D Data (LiDAR, Radar)
3D data helps models overcome the lack of depth information from 2D data such as traditional RGB camera sensors, helping machine learning models get a deeper understanding of a scene.

LiDAR (Light Detection and Ranging) is a remote sensing method that uses light to generate precise 3D images of scenes. LiDAR data is stored as point clouds in raw format and the .las file format and are often converted to JSON file format for processing by machine learning applications.
Radar (Radio Detection and Ranging) is a remote sensing method that uses radio waves to determine an object's distance, angle, and radial velocity relative to the radar source.
Audio
Typically stored as .mp3 or .wav file formats, audio data enables speech recognition for your favorite smart assistant and real-time multilingual machine translation.
Text
Text data made of characters representing information, often stored in .txt, .docx, or .html files. Text powers Natural Language Processing (NLP) applications such as virtual assistants when they answer your questions, automated translation, text-to-speech, speech-to-text, and document information extraction.
Why is Data Annotation Important?
Machine learning powers revolutionary applications made possible by vast amounts of high-quality data. To better understand the importance of data labeling, it is critical to understand the different types of machine learning: supervised, unsupervised, and reinforcement learning.
Reinforcement Learning leverages algorithms to take actions in an environment to maximize a reward. For instance, Deepmind’s AlphaGo used reinforcement learning to play games against itself to master the game of GO and become the strongest player in history. Reinforcement learning does not rely on labeled data but instead maximizes a reward function to achieve a goal.
Supervised Learning vs. Unsupervised Learning
Supervised learning is behind the most common and powerful machine learning applications, from spam detection to enabling self-driving cars to detect people, cars, and other obstacles. Supervised learning uses a large amount of labeled data to train a model to accurately classify data or predict outcomes.
Unsupervised learning helps analyze and cluster unlabeled data, driving systems like recommendation engines. These models learn from features of the dataset itself, without any labeled data to "teach" the algorithm the expected outputs. A common approach is K-means clustering, which aims to partition n observations into k clusters and assign each observation to the nearest mean.
While there are many fantastic applications for unsupervised learning, supervised learning has driven the most high-impact applications due to its high accuracy and predictive capabilities.
Machine learning practitioners have turned their attention away from model improvement to improving data, coining a new paradigm: data-centric ai. Only a tiny fraction of real-world ML systems are composed of ML code. More high-quality data and accurate data labels are necessary to power better AI. As the methods to create better machine learning models shift to data-centricity, it is essential to understand the entire process of a well-defined data pipeline, from data collection methods to data labeling to data curation.
This guide focuses on the most common types of data labels and the best practices for high quality so that you can get the most out of your data and therefore get the most out of your models.
How to Annotate Data
To create high-quality supervised learning models, you need a large volume of data with high-quality labels. So, how do you label data? First, you will need to determine who will label your data. There are several different approaches to building labeling teams, and each has its benefits, drawbacks, and considerations. Let's first consider whether it is best to involve humans in the labeling process, rely entirely on automated data labeling, or combine the two approaches.
1. Choose Between Humans vs. Machines
Automated Data Labeling
For large datasets consisting of well-known objects, it is possible to automate or partially automate data labeling. Custom Machine Learning models trained to label specific data types will automatically apply labels to the dataset.
Establishing high-quality ground-truth datasets early on, and only then can you leverage automated data labeling. Even with high-quality ground truth, it can be challenging to account for all edge cases and to fully trust automated data labeling to provide the highest quality labels.
Human Only Labeling
Humans are exceptionally skilled at tasks for many modalities we care about for machine learning applications, such as vision and natural language processing. Humans provide higher quality labels than automated data labeling in many domains.
However, human experiences can be subjective to varying degrees and training humans to label the same data consistently is a challenge. Furthermore, humans are significantly slower and can be more expensive than automated labeling for a given task.
Human in the Loop (HITL) Labeling
Human-in-the-loop labeling leverages the highly specialized capabilities of humans to help augment automated data labeling. HITL data labeling can come in the form of automatically labeled data audited by humans or from active tooling that makes labeling more efficient and improves quality. The combination of automated labeling plus human in the loop nearly always outpaces the accuracy and efficiency of either alone.

2. Assemble Your Labeling Workforce
If you choose to leverage humans in your data labeling workforce, which we highly recommend, you will need to figure out how to source your labeling workforce. Will you hire an in-house team, convince your friends and family to label your data for free, or scale up to a 3rd Party labeling company? We provide a framework to help you make this decision below.
In-House Teams
Small startups may not have the capital to afford significant investments in data labeling, so they may end up having all the team members, including the CEO, label data themselves. For a small prototype, this approach may work but is not a scalable solution.
Large, well-funded organizations may choose to keep in-house labeling teams to keep control over the entire data pipeline. This approach allows for much control and flexibility, but it is expensive and much work to manage.
Companies with privacy concerns or sensitive data may choose in-house labeling teams. While a perfectly valid approach, this can be difficult to scale.
Pros: Subject matter expertise, tight control over data pipelines
Cons: Expensive, overhead in training and managing labelers
Crowdsourcing
Crowdsourcing platforms provide a quick and easy way to quickly complete a wide array of tasks by a large pool of people. These platforms are fine for labeling data with no privacy concerns, such as open datasets with basic annotations and instructions. However, if more complex labels are needed or sensitive data is involved, the untrained resource pool from crowdsource platforms is a poor choice. Resources found on crowdsourcing platforms are not trained well and lack domain expertise, often leading to poor quality labeling.
Pros: Access to a larger pool of labelers
Cons: Quality is suspect; significant overhead in training and managing labelers
3rd Party Data Labeling Partners
3rd Party data labeling companies provide high-quality data labels efficiently and often have deep machine learning expertise. These companies can act as technical partners to advise you on best practices for the entire machine learning lifecycle, including how to best collect, curate, and label your data. With highly trained resource pools and state-of-the-art automated data labeling workflows and toolsets, these companies offer high-quality labels for a minimal cost.
To achieve extremely high quality (99%+) on a large dataset requires a large workforce (1,000+ data labelers on any given project). Scaling to this volume at high quality is difficult with in-house teams and crowdsourcing platforms. However, these companies can also be expensive and, if they are not acting as a trusted advisor, can convince you to label more data than you may need for a given application.
Pros: Technical expertise, minimal cost, high quality; The top data labeling companies have domain-relevant certifications such as SOC2 and HIPAA.
Cons: Relinquish control of the labeling process; Need a trusted partner with proper certifications to handle sensitive data
3. Select Your Data Labeling Platform
Once you have determined who will label your data, you need to find a data labeling platform. There are many options here, from building in-house, using open source tools, or leveraging commercial labeling platforms.
Open Source Tools
These tools are free to use by anyone, with some limitations for commercial use. These tools are great for learning and developing machine learning and AI, personal projects, or testing early commercial applications of AI. While free, the tradeoff is that these tools are not as scalable or sophisticated as some commercial platforms. Some label types discussed in this guide may not be available in these open-source tools.
The list below is meant to be representative, but not exhaustive so many great open source alternatives may not be included.
- CVAT: Originally developed by Intel, CVAT is a free, open-source web-based data labeling platform. CVAT supports many standard label types, including rectangles, polygons, and cuboids. CVAT is a collaborative tool and is excellent for introductory or smaller projects. However, web users are limited to 500 MB of data and only ten tasks per user, reducing the appeal of the collaboration features on the web version. CVAT is available locally to avoid these data constraints.
- LabelMe: Created by CSAIL, LabelMe is a free, open-source data-labeling platform supporting community collaboration on datasets for computer vision research. You can contribute to other projects by labeling open datasets and label your data by downloading the tool. Labelme is quite limited compared to CVAT, and the web version no longer accepts new accounts.
- Stanford Core NLP: A fully featured NLP labeling and natural language processing platform, Stanford's CoreNLP is a robust open source tool offering Named Entity Recognition (NER), linking, text processing, and more.
In-house Tools
Building in-house tools is an option selected by some large organizations that want tighter control over their ML pipelines. You have direct control over which features to build, support your desired use cases, and address your specific challenges. However, this approach is costly, and these tools will need to be maintained and updated to keep up with the state-of-the-art.
Commercial Platforms
Commercial platforms offer high-quality tooling, dedicated support, and experienced labeling workforces to help you scale and can also provide guidance on best practices for labeling and machine learning. Supporting many customers improves the quality of the platforms for all customers, so you get access to state-of-the-art functionality that you may not see with in-house or open-source labeling platforms.
Scale Studio is the industry-leading commercial platform, providing best-in-class labeling infrastructure to accelerate your team, with labeling tools to support any use case and orchestration to optimize the performance of your workforce. Easily annotate, monitor, and improve the quality of your data.
High-Quality Data Annotations
Whatever annotation platform you use, maximizing the quality of your data labels is critical to getting the most out of your machine learning applications.
The classic computer science axiom "Garbage in Garbage Out" is especially acute in machine learning, as data is the primary input to the learning process. With poor quality data or labels, you will have poor results. We aim to provide you with the most critical quality metrics and discuss best practices to ensure that you are maximizing your labeling quality.
Different Ways to Measure Quality
We cover some of the most critical quality metrics and then discuss best practices to ensure quality in your labeling processes.
Label Accuracy
It is essential to analyze how closely labels follow your instructions and match your expectations. For instance, say you have assigned tasks to data labelers to annotate pedestrians. In your instructions, you have specified labels should include anything carried (i.e., a phone or backpack), but not anything that is pushed or pulled. When you review sample tasks, are the instructions followed, or are strollers (pushed) and luggage (pulled) included in the annotations?
How accurate are labelers on benchmark tasks? These are test tasks to determine overall labeler accuracy and give you more confidence that other labeled data will also be correct. Is labeling consistent across labelers or types of data? If label accuracy is inconsistent across different labelers, this may indicate that your instructions are unclear or that you need to provide more training to your labelers.
Model Performance Improvement
How accurate is your model at its specified task? This output metric is not solely dependent on labeling quality, the quantity and quality of data play a prominent role, but labeling quality is a significant factor to consider.
Let's review some of the most critical model performance metrics.
Precision
Precision defines what proportion of positive identifications were correct and is calculated as follows:
Precision = True Positives / True Positives + False Positives
A model that produces no false positives has a precision of 1.0
Recall
Recall defines what proportion of actual positives were identified correctly by the model, and is calculated as follows:
Recall = True Positives / True Positives + False Negatives
A model that produces no false negatives has a recall of 1.0

Precision Recall Curve
A model with high recall but low precision returns many results, but many of the predicted labels are incorrect compared to the ground truth labels. On the other hand, a model with high precision but low recall is just the opposite, returning very few results, but most of its predicted labels are correct when compared to the ground truth labels. An ideal model with high precision and high recall will return many results, with all results labeled correctly.
Precision recall curves provide a more nuanced understanding of model performance than any single metric. Precision and recall are a tradeoff; the exact desired values depend on your model and its application.
For instance, for a diagnostic imaging application tasked to detect cancerous tumors, higher recall is desired as it is better to predict that a non-cancerous tumor is cancerous than the alternative of labeling a cancerous tumor as non-cancerous.
Alternatively, applications like SPAM filters require high precision so that important emails are not incorrectly flagged SPAM, even though this may allow more actual SPAM to enter our inbox.

Intersection Over Union (IoU)
An indirect method of confirming label quality in computer vision applications is an evaluation metric called Intersection over Union (IoU).
IoU compares the accuracy of a predicted label over the ground truth label by measuring the ratio of the area of overlap of the predicted label to the area of the union of both the predicted label and the ground truth label.
The closer this ratio is to 1, the better trained the model is.

As we discussed earlier, the essential factors for model training are high-quality data and labels. So, IoU is an indirect measure of the quality of data labels. You may have a different threshold of quality that you are focused on depending on the class of object. For instance, if you are building an augmented reality application focused on human interaction, you may need your IoU at 0.95 for identifying human faces but only need an IoU of 0.70 for identifying dogs.
Again, it is critical to remember that other data-related factors can influence IoU, such as bias in the dataset or an insufficient quantity of data.
Confusion Matrices
Confusion matrices are a simple but powerful tool to understand class confusion in your model better. A confusion matrix is a grid of classes comparing predicted and actual (ground truth) classifications. By examining the confusion matrix, you can quickly understand misclassifications, such as your model predicting a traffic sign when the ground truth is a train.
Combining these confusions with confidence scores can provide an easy way to prioritize the object confusion, looking at those instances where the model is highly confident in an incorrect prediction. Class confusions are likely due to an issue with incorrect or missing labels or with an inadequate quantity of data containing the traffic sign and train classes.

Best Practices for Achieving High-Quality Labels:
- Collect the best data possible: Your data should be high quality and as consistent as possible while avoiding the bias that may reduce your model's usefulness. Ideally, your data collection pipeline is integrated into your labeling pipeline to increase efficiency and minimize turnaround times.
- Hire the right labelers for the job: Ensure that your labelers speak the right language, are from a specific region, or are familiar with a particular domain. Also, ensure that your labelers are properly incentivized to provide high-quality labels.
- Combine humans and machines: use ML-powered labeling tooling with humans in the loop (HITL) for the highest accuracy labels.

- Provide clear and comprehensive instructions: This will help to ensure that different labelers will label data consistently.

- Curate Your data: As you look to improve your model performance, you will also want to curate your data. Use a data curation tool such as Scale Nucleus to explore your data and identify data that is completely missing labels or improperly labeled. Review your dataset's IoU, ROC curve, and confusion matrix to understand poor model performance better. The best data curation tools will also allow you to visually interact with these charts to visually inspect data related to a specific confusion and even send it to your labeling team to correct. You may also discover that you are missing data, in which case you will need to collect more data to label.
- Benchmark tasks and screening: Collect high-confidence responses to a subset of labeling tasks and use these tasks to estimate the quality of your labelers. Mix these benchmark tasks into other tasks. Use the performance on benchmark tasks to determine if an individual labeler understands your instructions and is capable of providing your desired quality. You can screen labelers who do not pass your benchmark tasks and either retrain them or exclude them from your project.

- Inspect common answers for specific data: Looking at common answers for labels can help you identify trends in labeling errors. If all data labelers are incorrectly classifying a particular object or mislabeling a piece of data, then maybe the issue is not with the labeler but lies somewhere else. Reevaluate your ground truth, instructions, and training processes to ensure that your expectations have been clearly communicated. Once identified, add common mistakes to your instructions to avoid these issues in the future.
- Update your instructions and golden datasets as you encounter edge cases
- Create calibration batches to ensure that your instructions are clear and that quality is high on a small sample of your data before scaling up your labeling tasks.
- Establish a consensus pipeline: Implement a consensus pipeline for classification or text-based tasks with more subjectivity. Use a majority vote or a hierarchical approach based on the experience or proven quality of an individual or group of data labelers.
- Establish layers of review: Establish a hierarchical review structure for computer vision tasks to ensure that the labels are as accurate as possible.
- Randomly sample labeled data for manual auditing: Randomly sample your labeled data and audit it yourself to confirm the quality of the sample. This approach will not guarantee that the entire dataset is labeled accurately but can give you a sense of general performance on labeling tasks.
- Retrain or remove poor annotators: If an annotator's performance does not improve over time and with retraining, then you may need to remove them from your project.
- Measure your model performance: Whether your model performance improves or degrades can often be directly reflected in the quality of your data labels. Use model validation tools such as Scale validate to critically evaluate precision, recall, intersection over union, and any other metrics critical to your model performance.
Data Labeling for Computer Vision
Computer vision is a field in artificial intelligence focused on understanding data from 2D images, videos, or 3D inputs and making predictions or recommendations based on that data. The human vision system is particularly advanced, and humans are very good at computer vision tasks.
In this chapter, we explore the most relevant types of data labeling for computer vision and provide best practices for labeling each type of data.
1. Bounding Box
The most commonly used and simplest data label, bounding boxes are rectangular boxes that identify the position of an object in an image or video
Data labelers draw a rectangular box over an object of interest, such as a car or street sign. This box defines the object's X and Y coordinates.

By "bounding" an object with this type of label, machine learning models have a more precise feature set from which to extract specific object attributes to help them conserve computing resources and more accurately detect objects of a particular type.
Object detection is the process of categorizing objects along with their location in an image. These X and Y coordinates can then be output in a machine-readable format such as JSON.
Typical Bounding Box Applications:
- Autonomous driving and robotics to detect objects such as cars, people, or houses
- Identifying damage or defects in manufactured objects
- Household object detection for augmented reality applications
- Anomaly detection in medical diagnostic imaging
Best Practices:
- Hug the border as tightly as possible. Accurate labels will capture the entire object and match the edges as closely as possible to the object's edges to reduce confusion for your model.
- Avoid item overlap. Due to IoU, bounding boxes work best when there is minimal overlap between objects. If objects overlap significantly, using polygon or segmentation annotations may be better.
- Object size: Smaller objects are better suited for bounding boxes, while larger objects are better suited for instance segmentation. However, annotating tiny objects may require more advanced techniques.
- Avoid Diagonal Lines: Bounding boxes perform poorly with diagonal lines such as walkways, bridges, or train tracks as boxes cannot tightly hug the borders. Polygons and instance segmentation are better approaches in these cases.
2. Classification
Object classification means applying a label to an entire image based on predefined categories, known as classes. Labeling images as containing a particular class such as "Dog," "Dress," or "Car" helps train an ML model to accurately predict objects of the same class when run on new data.

Typical Classification Applications:
- Activity Classification
- Product Categorization
- Image Sentiment Analysis
- Hot Dog vs. Not Hot Dog
Best Practices:
- Create clearly defined, easily understandable categories that are relevant to the dataset.
- Provide sufficient examples and training to your data labelers so that the requirements are clear and ambiguity between classes is minimized.
- Create benchmark tests to ensure label quality.
3. Cuboids
Cuboids are 3-dimensional labels that identify the width, height, and depth of an object, as well as the object's location.
Data labelers draw a cuboid over the object of interest such as a building, car, or household object, which defines the object's X, Y, and Z coordinates. These coordinates are then output in a machine-readable format such as JSON.
Cuboids enable models to precisely understand an object's position in 3D space, which is essential in applications such as autonomous driving, indoor robotics, or 3D room planners. Reducing these objects to geometric primitives also makes understanding an entire scene more manageable and efficient.

Typical Cuboid Applications:
- Develop prediction and planning models for autonomous vehicles using cuboids on pedestrians and cars to determine predicted behavior and intent.
- Indoor objects such as furniture for room planners
- Picking, safety, or defect detection applications in manufacturing facilities
Best Practices:
- Capture the corners and edges accurately. Like bounding boxes, ensure that you capture the entire object in the cuboid while keeping the label as tight to the object as possible.
- Avoid Overlapping labels where possible. Clean, non-overlapping cuboid data annotations will help your model improve object predictions and localizations in 3D space.
- Axis alignment is critical. Ensure that the alignment of your bounding boxes is on the same axis for objects of the same class.
- Keep your camera intrinsics in mind. Applying cuboids without understanding the camera's location will lead to poor prediction results when objects are not in the same position related to the camera in the future. The front face of a "true" cuboid will likely not be a perfect 90-degree rectangle, especially if it isn't facing the camera head-on. Furthermore, the edges of a cuboid parallel to the ground should converge to the horizon, while the top and bottom edges of the right side of the above annotation are parallel.
- Pair 2D Data with 3D Depth Data such as LiDAR.2D images inherently lack depth information, so pairing your 2D data with 3D depth data such as LiDAR will yield the best results for applications dependent on depth accuracy. See the section below on 3D Sensor fusion for more information on this topic.
4. 3D Sensor Fusion
3D sensor fusion refers to the method of combining the data from multiple sensors to accommodate for the weaknesses of each sensor. 2D images alone are not enough for current machine learning models to make sense of entire scenes. Estimating depth from a 2D image is challenging, and occlusion and limited field of view make relying on 2D images tricky. While some approaches to autonomous driving rely solely on cameras, a more robust approach is to overcome the limitations of 2D by supplementing with 3D systems using sensors such as LiDAR and Radar.

LiDAR (Light Detection and Ranging) is a method for determining the distance of objects with a laser to determine the depth of objects in scenes and create 3D representations of the scene.
Radar (Radio detection and ranging) uses radio waves to determine objects' distance, angle, and radial velocity.
This demo provides an interactive 3D sensor fused scene, and the video below gives a high-level overview of a similar scene.
Typical 3D Sensor Fusion Applications
- Autonomous Vehicles
- Geospatial and mapping applications
- Robotics and automation
Best Practices
- Ensure that your data labeling platform is calibrated to your sensor intrinsics (or better yet, ensure that your tooling is sensor agnostic) and supports different lens and sensor types, for example, fisheye and panoramic cameras.
- Look for a data labeling platform that can support large scenes, ideally with support for infinitely long scenes.
- Ensure that object tracking is consistent throughout a scene, even when an object leaves and returns to the scene.
- Include attribute support for understanding correlations between objects, such as truck cabs and trailers.
- Leverage linked instance IDs describing the same object across the 2D and 3D modalities.
5. Ellipses
Ellipses are oval data labels that identify the position of objects in an image. Data labelers draw an ellipse label on an object of interest such as wheels, faces, eyes, or fruit. This annotation defines the object's location in 2D space. The X and Y coordinates of the four extremal vertices of the ellipse can then be output in a machine-readable format such as JSON to fully define the location of the ellipse.

Applications
- Face Detection
- Medical Imaging Diagnosis
- Wheel Detection
Best Practices:
- The data to be labeled should be oval or circular; i.e., it is not helpful to label rectangular boxes with ellipses when a bounding box will yield better results.
- Use ellipses where there would be high overlap for bounding boxes or where objects are tightly clustered or occluded, such as in bunches of fruit. Ellipses can tightly hug the borders of these objects and provide a more targeted geometry to your model.
6. Lines
Lines identify the position of linear objects like roadway markers. Data labelers draw lines over areas of interest, which define the vertices of a line. Labeling images with lines helps to train your model to identify boundaries more accurately. The X and Y coordinates of the vertices of the lines can then be output in JSON.

Typical Lines Applications
- Label roadway markers with straight or curved lines for autonomous vehicles
- Horizon lines for AR/VR applications
- Define boundaries for sporting fields
Best Practices
- Label only the lines that matter most to your application.
- Match the lines to the shape of the lines in the image as closely as possible.
- Depending on the use case, it could be important for lines not to intersect.
- Center the line annotation within the line in the image to improve model performance.
7. Points
Points are spatial locations in an image used to define important features of an object. Data labelers place a point on each location of interest, representing that location's X and Y coordinates. These points may be related to each other, such as when annotating a human shoulder, elbow, and wrist to identify the larger moving parts of an arm. These labels help machine learning models more accurately determine pose estimations or detect essential features of an object.

Typical Points Applications
- Pose estimation for fitness or health applications or activity recognition
- facial feature points for face detection
Best Practices
- Label only the points that are most critical to your application. For instance, if you are building a face detection application, focus on labeling salient points on the eyes, nose, mouth, eyebrows, and the outline of the face.
- Group points into structures (hand, face, and skeletal keypoints), and the labeling interface should make it efficient for taskers to visualize the interconnections between points in these structures.
8. Polygons
While bounding boxes are quick and easy for data labelers to draw, they are not precise in mapping to irregular shapes and can leave large gaps around an object. There is a tradeoff between accuracy and efficiency in using bounding boxes and polygons. For many applications, bounding boxes provide sufficient accuracy for a machine learning model with minimal effort. However, some applications require the increased accuracy of polygons at the expense of a more costly and less efficient annotation.
Data labelers draw a polygon shape over an object of interest by clicking on relevant points of the object to complete an entirely connected annotation. These points define the vertices of the polygon. The X and Y coordinates of these vertices are then output in JSON.

Typical Polygons Applications
- Irregular objects such as buildings, vehicles, or trees for autonomous vehicles
- Satellite imagery of houses, pools, industrial facilities, planes, or landmarks
- Fruit detection for agricultural applications
Best Practices
- Objects with holes or those split into multiple polygons due to occlusion (a car behind a tree, for example) require special treatment. Subtract the area of each hole from the object.
- Avoid slight overlaps between polygons that are next to each other.
- Zoom in closely to each object to ensure that you place points close to each object's borders.
- Pay close attention to curved edges, making sure to add more vertices to 'smooth' these edges as much as possible.
- Leverage the Auto Annotate Polygon tool to efficiently label objects. Automatically and quickly generate high-precision polygon annotations by highlighting specific objects of interest with an initial, approximate bounding box.
Follow these steps to achieve success with the Auto Annotate Polygon tool:
- Include all parts of the object of interest.
- Exclude overlapping object instances and other objects as much as possible.
- Keep the bounding box tight to the borders of the object.
- Use click to include/exclude to refine the automatically-generated polygon by instantly performing local edits - include and exclude specific areas of interest.
- Further, refine the polygon by increasing or decreasing vertex density to smooth curved edges.
9. Segmentation
Segmentation labels relate to pixel-wise labels on an image and come in three common types, semantic segmentation, instance segmentation, and panoptic segmentation.
Semantic Segmentation
Label each pixel of an image with a class of what is being represented, such as a car, human, or foliage. Referred to as "dense prediction," this is a time-consuming and tedious process.
With semantic segmentation, you do not distinguish between separate objects of the same class (see instance segmentation for this).

Instance segmentation
Label every pixel of each distinct object of an image. Unlike semantic segmentation, instance segmentation distinguishes between separate objects of the same class (i.e., identifying car 1 as separate from car 2)

Panoptic Segmentation
Panoptic segmentation is the combination of instance segmentation and semantic segmentation. Each point in an image is assigned a class label (semantic segmentation) AND an instance label (instance segmentation). Each instance can represent distinct objects such as cars, people, or regions such as the road or sky. Panoptic segmentation provides more context than instance segmentation and is more detailed than semantic segmentation, making them useful for deeper scene understanding.
Typical Segmentation Applications
- Autonomous Vehicles and Robotics: Identify pedestrians, cars, trees
- Medical Diagnostic imaging: tumors, abscesses in diagnostic imaging
- Clothing: Fashion retail
Best Practices
- Carefully trace the outlines of each shape to ensure that all pixels of each object are labeled.
- Use ML-assisted tooling like the boundary tool to quickly segment borders and objects of interest.
- After segmenting borders, use the flood fill tool to fill in and complete segmentation masks quickly.
- Use active tools like Autosegment to increase the efficiency and accuracy of your labelers.
Explore Coco-Stuff on Nucleus for a large collection of data with segmentation labels!
10. Special considerations for Video Labeling
You can apply many of the same labels to images and videos, but there are some special considerations for video labeling.
Temporal linking of labels
Video annotations add the dimension of time to the data fed to your models, so your model needs a way to understand related objects and labels between frames.
Manually tracking objects and adding labels through many video frames is time and resource intensive. You can leverage several techniques to increase the efficiency and accuracy of temporally linked labels.
- First, you can leverage video Interpolation to interpolate between frames of your video to smooth out the images and labels to make it easier to track labels through the video.
- You should also look for tools that automatically duplicate annotations between frames, minimizing human intervention needed to correct for misaligned labels.
- If you are working with videos, ensure that the tools you are using can handle the storage capacity of the video file and can stitch together hour-long videos so that you retain the same context no matter how long the video.
In videos, objects may leave the camera view and may return at a later time. Leverage tools that enable you to track these objects automatically or make sure to remember to annotate these objects with the same unique ids.
Multimodal
Multimodal machine learning attempts to understand the world from multiple, related modalities such as 2D images, audio, and text.
Combining multiple labeling types such as human keypoints, bounding boxes, and transcribed audio with entity recognition all connected in rich scenes.
Typical Multimodal Applications
- AR/VR full scene understanding Video/GIF/Image (Object Detection + Human Keypoints + Audio Transcription + Entity Recognition)
- Sentiment analysis by combining video gestures and voice data
Best Practices
- Incorporate temporal linking to ensure that models fully understand the entire breadth of each scene.
- Identify which modalities are best suited for your application. For instance, if you are working on sentiment analysis for AR/VR applications, you will want to consider not only 2D video object or human keypoint labels but also audio transcription and entity recognition in addition to sentiment classification so that you have a rich understanding of the entire scene and how the individuals in the scene contribute towards a particular sentiment (i.e., if the person is yelling and gesturing wildly you can determine the sentiment is "upset").
- Include Human in the loop to ensure consistency across modalities. Assign complex scenes to only the most experienced taskers.
Synthetic Data
Synthetic data is digitally-generated data that mimics real-world data. This data is often generated from artists using graphics computer tooling or generated programmatically from models like Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), or Neural Radiance Fields (NERFs). Synthetic data includes perfect ground truth labels automatically, without requiring additional human intervention to label the data.

Typical Synthetic Data Applications
- Digital Humans for autonomous vehicles and robotics, particularly in long-tail edge cases such as pedestrians walking on shoulders
- Digital humans for fitness and health applications. Getting enough real-world human pose data is difficult and expensive, whereas synthetic data is relatively easy to generate and cheaper.
- Manufacturing defect detection
Best Practices
- If using a mix of Synthetic and real-world data, ensure that the labels of your real data are as accurate as possible. Synthetic data generates perfectly accurate labels, so any label inaccuracies in your real data will degrade your model's predictive capabilities.
- Leverage synthetic data for data that is difficult to collect due to privacy concerns, rare edge cases, to avoid bias, or prohibitively expensive data collection and labeling methods.
- Integrate synthetic data with your existing data pipelines to maximize your ROI
- Curate your data using best-in-class tools to ensure that you are surfacing the edge cases for which you need more data. Ideally, your dataset curation tool will also integrate into your labeling pipelines.
NLP Data Labeling
Labeling text enables natural language processing algorithms to understand, interact with, and generate text for various applications ranging from chatbots to machine translation to product review sentiment analysis.
Like computer vision, there is a wide variety of text label types, and we will cover the most common labels in this guide.
1. Part of Speech Tagging (POS)
Categorizing words in a text (corpus) with a particular part of speech depending on the word's definition and context. This basic tagging enables machine learning models to understand natural language better.
Labeling parts of speech enables chatbots and virtual assistants to have more relevant conversations and better understand the content with which it interacts.

quote source: Scale Zeitgeist, Eric Schmidt, Schmidt Ventures
2. Named Entity Recognition (NER)
Named Entity recognition is similar to speech tagging but focuses on classifying text into predefined categories such as person names, organizations, locations, and time expressions. Linking entities to establish contextual relationships (i.e., Ada Lovelace is the child of Lord and Lady Byron) adds another layer of depth to your model's understanding of the text.

Applications
- Improve search terms
- Ad serving models
- Identify terms in customer interactions (i.e., support threads, chatbots, social media posts) to map to specific topics, brands, etc.
3. Classification
Classify text into predefined categories, such as customer sentiment or chatbots to accurately monitor brand equity, trends, etc.

Applications
- Customer sentiment
- GPT-3 Fine Tuning
- Intent on social media or chatbots
- Active monitoring of brand equity
4. Audio
Transcribe audio data into text for natural language models to make sense of the data. In this case, the text data becomes the label for the audio data. Add further depth to text data with named entity recognition or classification.

5. Best Practices for labeling Text
- Use native speakers, ideally those with a cultural understanding that mirrors the source of the text.
- Provide clear instructions on the parts of speech to be labeled and train your labelers on the task.
- Set up benchmark tasks and build a consensus pipeline to ensure quality and avoid bias.
- Leverage rule-based tagging/heuristics to automatically label known named entities (i.e., "Eiffel Tower") and combine this with humans in the loop to improve efficiency and avoid subtle errors for critical cases.
- Deduplicate data to reduce labeling overhead.
- Leverage native speakers and labelers with relevant cultural experience to your use case to avoid confusion around subtle ambiguities in language. For example, Greeks will associate the color "purple" with sadness. At the same time, those from China and Germany will consider purple emotionally ambivalent, and those from the UK may think purple to be positive.
Conclusion
Hopefully, you have found this guide helpful as you progress in your understanding of machine learning and that you can apply these practical insights to improve your data labeling pipelines. We wanted to share the best practices we learned from providing billions of annotations to some of the largest companies in the world.
As the machine learning market grows, there is an ever greater need for high-quality data and labels. We suggest you take a holistic approach to machine learning, from identifying your core business challenges to collecting and labeling your data.
This guide aims to equip you with the knowledge you need to set up high-quality data annotation pipelines. If you are using your own labeling workforce, check out Scale Studio, which offers a best-in-class annotation infrastructure built by expert annotators. Alternatively, Scale Rapid provides an easy way to offload your data labeling and ramp to production volumes if you need labeling resources. If you have any other questions, please reach out, and we will be happy to help.
Additional Resources
1. Scale Studio
Scale's labeling infrastructure, just bring your own labeling workforce.
2. Scale Rapid
Scale up to production workloads with Scale's expert labeling workforce.
Curate your data and identify opportunities to improve annotations.
Dive deeper into the importance of human annotations for deep learning models in this paper written by Scale's own Zeyad Emam, Andrew Kondrich, Sasha Harrison, Felix Lau, Yushi Wang, Aerin Kim, and Elliot Branson.
5. How to Set Up Data Annotation Pipelines
Important factors to consider when setting up you data annotation pipelines, from tooling to automation approaches.