
Scale has spent years helping build the data infrastructure behind today's language models. Recently, we've been working with teams tackling a similarly difficult problem: teaching AI to understand and generate human speech. While text models could learn from the entire internet, there's no equivalent training corpus for voice—that data has to be purpose-built. Here's why we think this challenge is worth solving.
Large Language Models have demonstrated that text is the universal interface, in large part because it is the native language of computing. Code, web Browse, and even chess can be reduced to sequences of text. Since the advent of the Transformer, these models have moved beyond the Turing Test to excel at evaluations once considered the frontier of human ability, such as Humanity’s Last Exam.
Yet the physical world we inhabit is just beginning to be transformed by AI. This new era will be guided by agents that interact with their surroundings as we do: through voice and vision. This requires models that can discern the rich, nuanced data layered within human speech, far beyond the words themselves. The problem is that this data doesn't exist in a prepackaged library. While LLMs could be trained on the enormous corpus of the public internet, no such equivalent exists for the millions of hours of labeled, diverse, emotional speech needed to teach an AI to truly listen.
The human voice is our original interface. Before the GUI or the printing press, civilization was built on the oral tradition. Central to our prophecies of the future, we have long projected our machines will talk naturally. As Anna Piñol and Valerie Osband Mahoney of NFX write, “Every time we imagine a computer in science fiction it has a voice. Jarvis. C-3P0. Samantha from Her. Humanoid or not, our idea of a futuristic machine is one that speaks to us.” As an information-dense medium, a single spoken phrase can contain layers of meaning absent from its written counterpart. Sarcasm, misdirection, and humor can all be encoded into the same sequence of words through paralinguistic elements like tone, prosody, and inflection.
While the vision for voice AI is clear, the technology is still in its early stages. The primary technical hurdle is to achieve natural, real-time conversation that feels as natural as authentic human speech. The industry has converged on a key threshold for this: sub-300ms latency. Reaching this speed is prompting a shift toward new, natively Speech-to-Speech (S2S) models.
To understand the data researchers need, we start by asking what products are being built for end users and work backwards. The ambition for voice AI is reshaping both our personal and professional lives.
The holy grail is a persistent voice companion with memory and personality, acting as a true digital counterpart. This vision extends to more specific applications:
In business, reliability, steerability, and observability, are critical. To trust an agent in production requires robust knowledge retrieval and an understanding of specific professional domains.
Many of these future applications are raring to advance far faster, but are hamstrung by the same fundamental challenge: the data bottleneck.
The breakneck improvement of LLMs was facilitated by a simple foundation: the entire public internet served as a training library for text models. As Dr. Fei-Fei Li has noted, this vast corpus of data fueled rapid progress. But for voice, no such public repository has ever existed. So we built it. Scale’s dynamic, high-quality audio data platform is supercharged with:
Building a state-of-the-art voice model requires a structured approach to data, moving from a broad foundation to specialized, high-quality refinement. We provide the data for each phase of this process.
Sutton's 'Bitter Lesson' teaches that massive compute applied to general methods is the winning strategy. Accordingly, the foundation of any great voice model is a massive and diverse dataset. We begin with vast amounts of unscripted conversational audio designed to teach the model the fundamental patterns of real-world human speech. This data captures:
While pre-training builds the foundation, SFT data defines the upper limit of a model's performance. We produce this data at pristine studio quality, with every hour engineered by seasoned audio professionals through multiple stages of proprietary post-processing.
“Every single contributor is auditioned and then checked week after week to ensure consistent, high-quality audio standards are maintained. Utilizing proprietary workflows... we keep our audio quality at the same high level for all contributors worldwide, regardless of their equipment or recording environment.”
-Jacob C. Audio Engineer
Our SFT data is designed to teach specific, high-value skills:
Topical Expertise: Data from expert conversations and scripted business meetings to teach the model deep knowledge about specific domains like law, medicine, or finance.
Linguistic Precision: A diverse range of accents, dialects, and even tongue twisters to ensure the model's pronunciation is clear and consistent across any challenge (see Figure 2).
Technical Transcripts: For training strategies that require them, we provide high-fidelity, word-level transcripts and detailed phonetic data to ensure optimal performance and accuracy.
Emotions and Paralinguistics: Scenarios designed to teach the model how to understand and express nuanced emotions for applications like de-escalating a support call, making a persuasive sale, or roleplaying. This requires capturing the dynamics of real-world, multi-turn conversations, complete with interruptions and overlapping speech (see Figure 1).
“When you train an AI for voice work, you’re not just performing—you’re teaching. You’re helping the model learn how to interpret emotion, tone, timing, and personality, making it more relatable and emotionally intelligent.”
- Michelle E., Voice Actor

Figure 1: An example of a full duplex, multi-turn conversation. This level of detail, capturing interruptions and precise timestamps, is crucial for training models on the complex dynamics of real-world human speech.

Figure 2: Detailed phonetic transcription used to fine-tune a model’s pronunciation. This ensures the AI can achieve clear, consistent, and human-like pronunciations for any word.
Once a foundational model is trained, it must be rigorously tested, refined, and aligned with human preferences to be useful and safe. This process involves a deep partnership between AI and human evaluators across several key areas.
Human Preference Scoring (RLHF)
For subjective qualities like “naturalness” or “helpfulness,” human judgment is the only ground truth. Our process involves direct conversational evaluation, where human testers rank model responses to create the preference data that powers Reinforcement Learning from Human Feedback (RLHF). This feedback is captured using standard statistical metrics like Mean Opinion Score (MOS) and Likert scales.
Real-World Robustness Testing
A model that performs well in a lab can fail in a noisy cafe. We ensure real-world robustness by collecting data “in-the-wild” across a huge diversity of verified environments (meetings, public transit, etc.) and devices (phones, earphones, wearables). This is made possible by our large, distributed contributor base, which provides the scale necessary for statistically significant evaluation.
Advanced Reward Modeling
While simple preference data is powerful, complex tasks require more sophisticated reward signals. While a simple reward signal can be enough to create surprisingly intelligent agents, the core challenge in a subjective domain like voice is constructing that signal. Building on the success of foundational algorithms like Proximal Policy Optimization (PPO), newer methods like Group Relative Policy Optimization (GRPO) show a promising path forward. However, their efficacy hinges on the quality of the reward model itself. The key is the craftsmanship in constructing complex, multi-dimensional rubrics that can effectively guide the model, especially for business cases where we can tie rewards to verified outcomes like a resolved support ticket or a completed sale.
Adversarial Testing & Red Teaming
The safety and alignment requirements for voice AI are far more stringent than for text. We have nove frameworks to protect against adversarial attacks like voice spoofing, phishing, and confidence scams. Our red teams vet models to ensure they don’t leak sensitive information or provide harmful advice, ensuring safety at the model, product, and system levels.
Agentic Tool Use
The ultimate goal of a voice agent is to move from conversation to action. We train models to reliably use tools, call APIs, and execute tasks based on voice commands. This is achieved through sophisticated techniques like Process Supervision, where the model is rewarded for following the correct steps in a complex task (not just the final outcome), and Verified Rewards, which ties model success to concrete events like a successfully booked flight or a completed order.
Multimodal Perception (Voice + Vision)
As companies like Meta, Snap, and Sesame develop AI-powered glasses, voice is emerging as the primary user interface for interacting with the physical world. True understanding in this context requires multimodality. We merge our industry-leading video and image annotation capabilities with our audio data engine to train models that can see what you see and hear what you hear, enabling truly contextual, real-world assistants.
Expressive & Creative Generation
The same underlying technology that masters conversational speech can be extended to all forms of generative audio. By training models on tagged music samples, sound effects, and other rich audio, we enable agents to express themselves in a fuller medium, whether by generating music, a sound effect, or even singing.
The journey from text-based models to truly interactive, voice-driven AI marks a fundamental shift in how we will interact with technology. While the challenges of data complexity are immense, the path forward is no longer theoretical. The benchmarks are being set, the technical frontiers are being explored, and the applications that will define the next decades are being designed today. The future of human-computer interaction will be a conversation.
Core Contributors: Gokul Prabhakaran, Rodrigo Belfiore, Matias Devecchi Di Bella, Gorka Ludlow, Michael Chen, Brandon Kyung, George Varelas, Gaurang Goel, Niveditha Patil, Olivia Fu, Ali Khan, Daniel Quigley, Taku Yamada, Alejo Jimenez Chaidez, Utkarsh Tyagi, Advait Gosai & Kilian Butler.