
Millions of users are now talking to AI, moving us from text to the far messier world of sound. Human speech is full of nuance, punctuated by hesitation, interruption, and mid-sentence corrections, like a diner changing their mind midway through their order. Even the most advanced AI systems struggle to handle it. Accuracy in the lab, it turns out, doesn't always translate to accuracy in the wild.
To build voice agents that actually work, we need to stop grading them on their reading comprehension and start testing their conversational intelligence. This is where our new Audio MultiChallenge benchmark comes in. Audio MultiChallenge is the first benchmark designed to stress-test the conversational robustness of native Speech-to-Speech (S2S) models.
Audio MultiChallenge tests models on specific competencies across dialogues ranging from 2 to 8 turns. We isolate four specific capabilities that distinguish a robust agent from a brittle one:
Voice editing is the most critical stress test for real-world usability. In natural speech, human intent is "non-monotonic"; we backtrack, edit, and interrupt. Consider the diner example: "I'll have a salad, dressing on the side...actually, make that a wienerschnitzel." A text-based model sees the final transcript, often cleaned of interruptions to the flow of speech. But a native audio model hears the linear stream of tokens for "salad" first. To succeed, the model must recognize the correction and "overwrite" its initial understanding in real-time. Audio MultiChallenge tests these self-corrections, barge-ins, and asides to see if the model can keep up with a user's changing mind.
Standard benchmarks often transcribe audio to text before grading, which strips away vital context, since so much of meaning in speech comes through intonation. Audio MultiChallenge tests for this context: can a model pick up on these details that exist only in the raw audio, not in the text transcript? We refer to this capability as Audio-Cue Gated Inference Memory. In our testing, models performed considerably worse on these tasks compared to standard semantic recall. Models are good at processing language, but they are still learning how to process sound.
Instruction Retention tests the ability to follow a specific directive, such as "speak in a specific persona," "use a limited number of words," or "avoid negative phrases" throughout the entire duration of a multi-turn dialogue. We found that audio models frequently drop these instructions even in short interactions, with their performance further degrading for “conditional” instructions such as “present a counter argument for all of my ideas.” Unlike text models, which maintain constraints well, audio models struggle to separate acoustic processing from rule-following, failing to adhere to directives regardless of conversation length.
Finally, we evaluate the model’s internal consistency. Self-Coherence tests whether the model maintains its own facts, timeline, and persona without contradicting itself. In a long voice conversation, a model might claim to be a fitness instructor in Turn 1 but offer contradictory, unhealthy advice in Turn 5. This metric ensures that the agent you are talking to at the end of the conversation is the same one you started with.
When we put today's frontier models to the test, the results revealed a clear hierarchy—and a stark reality about the state of native audio.
Google’s Gemini 3 Pro Preview emerged as the top-performing model with an Average Rubric Score (ARS) of 54.7%, demonstrating the strongest grasp of conversational nuance. It was followed by Gemini 2.5 Pro (46.9%) and Gemini 2.5 Flash (40.0%). OpenAI’s GPT-4o Audio Preview trailed with a score of 25.4%, highlighting that while latency and voice quality are improving across the board, complex reasoning in the audio modality remains a significant differentiator.
Across the board, we see our models degrade in performance by 36.5% relatively when they are challenged not only to retain semantic information, but also audio-cues such as ambient noise and speaker paralinguistics (such as emotion or tone). Voice Editing is our most challenging axis, highlighting that our models are not robust to mid-utterance hesitations or backtracking over turns. Such interaction patterns are fundamental to multi-turn audio conversations and require further attention.

APR (%) scores on Audio-Cue vs. Semantic Inference Memory tasks. (T) indicates text output, (A) indicates audio output.
We found that many native models, particularly ones configured to output text instead of audio, perform significantly better when tested on synthetic Text-to-Speech (TTS) data than on real human recordings. The natural disfluencies of human speech create a "noise" that degrades performance by up to 21.5% for some architectures. This confirms that training and evaluating on clean, synthetic audio is masking true failure modes. Curiously, the top-performing Gemini models were robust to this noise, performing slightly better on real human speech than TTS.
Native Speech-to-Speech models still lag behind traditional "Cascade" systems (Automatic Speech Recognition → Text LLM) on semantic tasks, primarily since these systems can leverage state-of-the-art LLM reasoning Frontier LLMs like GPT-5 or Claude Opus 4.5 score 51.2% and 39.22%, much higher than most of our end-to-end (E2E) speech architectures in this setup.
Curiously, we see that on our non-semantic “audio-cue” tasks, these cascaded systems can still achieve scores of up to 36% despite only being fed the transcripts and surrounding dialogue. This result indicates that audio understanding, particularly of paralinguistics and emotion can improve through strong text-only reasoning post training, whereas S2S architectures currently trade off some "IQ" for speed and expressiveness.
To catch these failures, we needed data that was specifically engineered to expose the cracks in current architectures. We built the Audio MultiChallenge dataset using a human-in-the-loop adversarial protocol.
First, an automated "Planner Agent" generated complex conversation blueprints designed to test specific logic and memory constraints. Then, instead of reading from a script or using text-to-speech, we had real humans act out these scenarios with a model in the loop, and gave them a specific mandate to "break" it. To ensure these complex interactions were graded fairly, we validated an LLM-as-a-judge approach (using o4-mini) that achieved high agreement with human raters.
This approach allowed us to capture the messy, spontaneous phenomena that synthetic benchmarks miss. By encouraging our human testers to interact naturally, we built a dataset that reflects the actual difficulty of speaking to a machine in the real world.
To close the gap between today’s Speech-to-Speech models and their text-based counterparts, we need evaluations that stop treating speech as just "text aloud." We need benchmarks that embrace the messiness, the interruptions, and the cognitive load of real human interaction. With Audio MultiChallenge, Scale is providing the means to measure our way forward, ensuring that the next generation of AI doesn't just hear us, but actually understands us.