by Matthew Siegel, Kilian Butler, Advait Gosai, Utkarsh Tyagi and Gokul Prabhakaran

Scale has spent years helping build the data infrastructure behind today's language models. Recently, we've been working with teams tackling a similarly difficult problem: teaching AI to understand and generate human speech. While text models could learn from the entire internet, there's no equivalent training corpus for voice—that data has to be purpose-built. Here's why we think this challenge is worth solving.
Large Language Models have demonstrated that text is the universal interface, in large part because it is the native language of computing. Code, web Browse, and even chess can be reduced to sequences of text. Since the advent of the Transformer, these models have moved beyond the Turing Test to excel at evaluations once considered the frontier of human ability, such as Humanity’s Last Exam.
Yet the physical world we inhabit is just beginning to be transformed by AI. This new era will be guided by agents that interact with their surroundings as we do: through voice and vision. This requires models that can discern the rich, nuanced data layered within human speech, far beyond the words themselves. The problem is that this data doesn't exist in a prepackaged library. While LLMs could be trained on the enormous corpus of the public internet, no such equivalent exists for the millions of hours of labeled, diverse, emotional speech needed to teach an AI to truly listen.
The human voice is our original interface. Before the GUI or the printing press, civilization was built on the oral tradition. Central to our prophecies of the future, we have long projected our machines will talk naturally. As Anna Piñol and Valerie Osband Mahoney of NFX write, “Every time we imagine a computer in science fiction it has a voice. Jarvis. C-3P0. Samantha from Her. Humanoid or not, our idea of a futuristic machine is one that speaks to us.” As an information-dense medium, a single spoken phrase can contain layers of meaning absent from its written counterpart. Sarcasm, misdirection, and humor can all be encoded into the same sequence of words through paralinguistic elements like tone, prosody, and inflection.
While the vision for voice AI is clear, the technology is still in its early stages. The primary technical hurdle is to achieve natural, real-time conversation that feels as natural as authentic human speech. The industry has converged on a key threshold for this: sub-300ms latency. Reaching this speed is prompting a shift toward new, natively Speech-to-Speech (S2S) models.
To understand the data researchers need, we start by asking what products are being built for end users and work backwards. The ambition for voice AI is reshaping both our personal and professional lives.
The holy grail is a persistent voice companion with memory and personality, acting as a true digital counterpart. This vision extends to more specific applications:
In business, reliability, steerability, and observability, are critical. To trust an agent in production requires robust knowledge retrieval and an understanding of specific professional domains.
Many of these future applications are raring to advance far faster, but are hamstrung by the same fundamental challenge: the data bottleneck.
The breakneck improvement of LLMs was facilitated by a simple foundation: the entire public internet served as a training library for text models. As Dr. Fei-Fei Li has noted, this vast corpus of data fueled rapid progress. But for voice, no such public repository has ever existed. So we built it. Scale’s dynamic, high-quality audio data platform is supercharged with:
Building a state-of-the-art voice model requires a structured approach to data, moving from a broad foundation to specialized, high-quality refinement. We provide the data for each phase of this process.
Sutton's 'Bitter Lesson' teaches that massive compute applied to general methods is the winning strategy. Accordingly, the foundation of any great voice model is a massive and diverse dataset. We begin with vast amounts of unscripted conversational audio designed to teach the model the fundamental patterns of real-world human speech. This data captures:
While pre-training builds the foundation, SFT data defines the upper limit of a model's performance. We produce this data at pristine studio quality, with every hour engineered by seasoned audio professionals through multiple stages of proprietary post-processing.
“Every single contributor is auditioned and then checked week after week to ensure consistent, high-quality audio standards are maintained. Utilizing proprietary workflows... we keep our audio quality at the same high level for all contributors worldwide, regardless of their equipment or recording environment.”
-Jacob C. Audio Engineer
Our SFT data is designed to teach specific, high-value skills:
“When you train an AI for voice work, you’re not just performing—you’re teaching. You’re helping the model learn how to interpret emotion, tone, timing, and personality, making it more relatable and emotionally intelligent.”
- Michelle E., Voice Actor

Figure 1: An example of a full duplex, multi-turn conversation. This level of detail, capturing interruptions and precise timestamps, is crucial for training models on the complex dynamics of real-world human speech.

Figure 2: Detailed phonetic transcription used to fine-tune a model’s pronunciation. This ensures the AI can achieve clear, consistent, and human-like pronunciations for any word.
Once a foundational model is trained, it must be rigorously tested, refined, and aligned with human preferences to be useful and safe. This process involves a deep partnership between AI and human evaluators across several key areas.
Human Preference Scoring (RLHF)
Real-World Robustness Testing
Advanced Reward Modeling
Adversarial Testing & Red Teaming
Agentic Tool Use
Multimodal Perception (Voice + Vision)
Expressive & Creative Generation
The journey from text-based models to truly interactive, voice-driven AI marks a fundamental shift in how we will interact with technology. While the challenges of data complexity are immense, the path forward is no longer theoretical. The benchmarks are being set, the technical frontiers are being explored, and the applications that will define the next decades are being designed today. The future of human-computer interaction will be a conversation.
Core Contributors: Gokul Prabhakaran, Rodrigo Belfiore, Matias Devecchi Di Bella, Gorka Ludlow, Michael Chen, Brandon Kyung, George Varelas, Gaurang Goel, Niveditha Patil, Olivia Fu, Ali Khan, Daniel Quigley, Taku Yamada, Alejo Jimenez Chaidez, Utkarsh Tyagi, Advait Gosai & Kilian Butler.