AI Doesn’t Live in Text Alone

Scale has spent years helping build the data infrastructure behind today's language models. Recently, we've been working with teams tackling a similarly difficult problem: teaching AI to understand and generate human speech. While text models could learn from the entire internet, there's no equivalent training corpus for voice—that data has to be purpose-built. Here's why we think this challenge is worth solving.
Large Language Models have demonstrated that text is the universal interface, in large part because it is the native language of computing. Code, web Browse, and even chess can be reduced to sequences of text. Since the advent of the Transformer, these models have moved beyond the Turing Test to excel at evaluations once considered the frontier of human ability, such as Humanity’s Last Exam.
Yet the physical world we inhabit is just beginning to be transformed by AI. This new era will be guided by agents that interact with their surroundings as we do: through voice and vision. This requires models that can discern the rich, nuanced data layered within human speech, far beyond the words themselves. The problem is that this data doesn't exist in a prepackaged library. While LLMs could be trained on the enormous corpus of the public internet, no such equivalent exists for the millions of hours of labeled, diverse, emotional speech needed to teach an AI to truly listen.
Humanity’s First Protocol
The human voice is our original interface. Before the GUI or the printing press, civilization was built on the oral tradition. Central to our prophecies of the future, we have long projected our machines will talk naturally. As Anna Piñol and Valerie Osband Mahoney of NFX write, “Every time we imagine a computer in science fiction it has a voice. Jarvis. C-3P0. Samantha from Her. Humanoid or not, our idea of a futuristic machine is one that speaks to us.” As an information-dense medium, a single spoken phrase can contain layers of meaning absent from its written counterpart. Sarcasm, misdirection, and humor can all be encoded into the same sequence of words through paralinguistic elements like tone, prosody, and inflection.
Mapping the Frontier of Voice
While the vision for voice AI is clear, the technology is still in its early stages. The primary technical hurdle is to achieve natural, real-time conversation that feels as natural as authentic human speech. The industry has converged on a key threshold for this: sub-300ms latency. Reaching this speed is prompting a shift toward new, natively Speech-to-Speech (S2S) models.
- Qwen Omni is hacking around S2S nativity with a Speech-to Text-to-Speech model using “token streaming”, vocalizing every token as it predicted. However, this approach sacrifices some naturalness for speed.
- Hume has pioneered Semantic Space Theory, categorizing the layered emotional tones of utterances. Their Empathic Voice Interface models can detect and express 53 dimensions of emotion with prosody.
- OpenAI’s new Speech-to-Speech model is noteworthy for Function Calling and Internet search informing responses. They co-mingle a "real time conversational model" and "a slower smarter model" for complex tasks.
- Eleven Labs is powering real-time, multilingual voice assistants. Most notable for a stocked developer platform for steerable TTS, S2S, Cloning, and Voice Design.
- NVIDIA has built a flexible sound machine, Fugatto, that can seamlessly mash up speech, music, and soundscapes. It lets users shape how voices and tunes morph in real time while inventing entirely new sounds.
- Sesame’s groundbreaking work employs a multimodal backbone with a smaller audio decoder, simultaneously leveraging semantic and acoustic data with residual vector quantization tokens.
Starting with the Future
To understand the data researchers need, we start by asking what products are being built for end users and work backwards. The ambition for voice AI is reshaping both our personal and professional lives.
Consumer Applications
The holy grail is a persistent voice companion with memory and personality, acting as a true digital counterpart. This vision extends to more specific applications:
- Dynamic Tour Guides that can whisper historical secrets in a quiet museum, not just recite facts.
- Truly Interactive Podcasts where you can ask the host to elaborate on a point or skip a segment with a simple command.
- Augmented Reality that comes alive through voice, from a simple heads-up display to in-the-wild interactions with agents like Meta's Ray-Bans or Cluely’s AI assistant.
- Immersive VR and Gaming Worlds where non-player characters have unique personalities and can hold unscripted, natural conversations.
Enterprise & Business Applications
In business, reliability, steerability, and observability, are critical. To trust an agent in production requires robust knowledge retrieval and an understanding of specific professional domains.
- Vertical-Specific Agents that understand the nuances, for example, of case law, medicine, or specific sales needs are necessary. The possibilities are endless from rote tasks like patient intake to inspired tools like context-aware meeting assistants.
Many of these future applications are raring to advance far faster, but are hamstrung by the same fundamental challenge: the data bottleneck.
The Data Bottleneck
The breakneck improvement of LLMs was facilitated by a simple foundation: the entire public internet served as a training library for text models. As Dr. Fei-Fei Li has noted, this vast corpus of data fueled rapid progress. But for voice, no such public repository has ever existed. So we built it. Scale’s dynamic, high-quality audio data platform is supercharged with:
- Language, Dialect, and Accent Coverage
- Highest “Studio” Recording Quality
- Versatility for Topics, Contexts, and Scripts
- Complete Metadata
- Trained Voice Actors
- Diverse Evaluators and Testers
A Data Engine for the Full Model Lifecycle
Building a state-of-the-art voice model requires a structured approach to data, moving from a broad foundation to specialized, high-quality refinement. We provide the data for each phase of this process.
Phase 1: Pre-training
Sutton's 'Bitter Lesson' teaches that massive compute applied to general methods is the winning strategy. Accordingly, the foundation of any great voice model is a massive and diverse dataset. We begin with vast amounts of unscripted conversational audio designed to teach the model the fundamental patterns of real-world human speech. This data captures:
- Real-world Dynamics: Conversations with 2, 3, 4, or even 5 speakers, complete with the natural interruptions, laughter, and disfluencies that define authentic speech.
- Global Representation: Deep coverage across a range of languages, dialects, and accents to ensure the model works for everyone.
Phase 2: Supervised Finetuning (SFT)
While pre-training builds the foundation, SFT data defines the upper limit of a model's performance. We produce this data at pristine studio quality, with every hour engineered by seasoned audio professionals through multiple stages of proprietary post-processing.
“Every single contributor is auditioned and then checked week after week to ensure consistent, high-quality audio standards are maintained. Utilizing proprietary workflows... we keep our audio quality at the same high level for all contributors worldwide, regardless of their equipment or recording environment.”
-Jacob C. Audio Engineer
Our SFT data is designed to teach specific, high-value skills:
-
Topical Expertise: Data from expert conversations and scripted business meetings to teach the model deep knowledge about specific domains like law, medicine, or finance.
-
Linguistic Precision: A diverse range of accents, dialects, and even tongue twisters to ensure the model's pronunciation is clear and consistent across any challenge (see Figure 2).
-
Technical Transcripts: For training strategies that require them, we provide high-fidelity, word-level transcripts and detailed phonetic data to ensure optimal performance and accuracy.
-
Emotions and Paralinguistics: Scenarios designed to teach the model how to understand and express nuanced emotions for applications like de-escalating a support call, making a persuasive sale, or roleplaying. This requires capturing the dynamics of real-world, multi-turn conversations, complete with interruptions and overlapping speech (see Figure 1).
“When you train an AI for voice work, you’re not just performing—you’re teaching. You’re helping the model learn how to interpret emotion, tone, timing, and personality, making it more relatable and emotionally intelligent.”
- Michelle E., Voice Actor
Figure 1: An example of a full duplex, multi-turn conversation. This level of detail, capturing interruptions and precise timestamps, is crucial for training models on the complex dynamics of real-world human speech.
Figure 2: Detailed phonetic transcription used to fine-tune a model’s pronunciation. This ensures the AI can achieve clear, consistent, and human-like pronunciations for any word.
Phase 3: Refinement and Alignment with Human Feedback
Once a foundational model is trained, it must be rigorously tested, refined, and aligned with human preferences to be useful and safe. This process involves a deep partnership between AI and human evaluators across several key areas.
Human Preference Scoring (RLHF)
-
For subjective qualities like “naturalness” or “helpfulness,” human judgment is the only ground truth. Our process involves direct conversational evaluation, where human testers rank model responses to create the preference data that powers Reinforcement Learning from Human Feedback (RLHF). This feedback is captured using standard statistical metrics like Mean Opinion Score (MOS) and Likert scales.
Real-World Robustness Testing
-
A model that performs well in a lab can fail in a noisy cafe. We ensure real-world robustness by collecting data “in-the-wild” across a huge diversity of verified environments (meetings, public transit, etc.) and devices (phones, earphones, wearables). This is made possible by our large, distributed contributor base, which provides the scale necessary for statistically significant evaluation.
Advanced Reward Modeling
-
While simple preference data is powerful, complex tasks require more sophisticated reward signals. While a simple reward signal can be enough to create surprisingly intelligent agents, the core challenge in a subjective domain like voice is constructing that signal. Building on the success of foundational algorithms like Proximal Policy Optimization (PPO), newer methods like Group Relative Policy Optimization (GRPO) show a promising path forward. However, their efficacy hinges on the quality of the reward model itself. The key is the craftsmanship in constructing complex, multi-dimensional rubrics that can effectively guide the model, especially for business cases where we can tie rewards to verified outcomes like a resolved support ticket or a completed sale.
Adversarial Testing & Red Teaming
-
The safety and alignment requirements for voice AI are far more stringent than for text. We have nove frameworks to protect against adversarial attacks like voice spoofing, phishing, and confidence scams. Our red teams vet models to ensure they don’t leak sensitive information or provide harmful advice, ensuring safety at the model, product, and system levels.
Phase 4: Extending to Advanced Capabilities
Agentic Tool Use
-
The ultimate goal of a voice agent is to move from conversation to action. We train models to reliably use tools, call APIs, and execute tasks based on voice commands. This is achieved through sophisticated techniques like Process Supervision, where the model is rewarded for following the correct steps in a complex task (not just the final outcome), and Verified Rewards, which ties model success to concrete events like a successfully booked flight or a completed order.
Multimodal Perception (Voice + Vision)
-
As companies like Meta, Snap, and Sesame develop AI-powered glasses, voice is emerging as the primary user interface for interacting with the physical world. True understanding in this context requires multimodality. We merge our industry-leading video and image annotation capabilities with our audio data engine to train models that can see what you see and hear what you hear, enabling truly contextual, real-world assistants.
Expressive & Creative Generation
-
The same underlying technology that masters conversational speech can be extended to all forms of generative audio. By training models on tagged music samples, sound effects, and other rich audio, we enable agents to express themselves in a fuller medium, whether by generating music, a sound effect, or even singing.
The Future is Listening
The journey from text-based models to truly interactive, voice-driven AI marks a fundamental shift in how we will interact with technology. While the challenges of data complexity are immense, the path forward is no longer theoretical. The benchmarks are being set, the technical frontiers are being explored, and the applications that will define the next decades are being designed today. The future of human-computer interaction will be a conversation.
Acknowledgements
Core Contributors: Gokul Prabhakaran, Rodrigo Belfiore, Matias Devecchi Di Bella, Gorka Ludlow, Michael Chen, Brandon Kyung, George Varelas, Gaurang Goel, Niveditha Patil, Olivia Fu, Ali Khan, Daniel Quigley, Taku Yamada, Alejo Jimenez Chaidez, Utkarsh Tyagi, Advait Gosai & Kilian Butler.