Audio MultiChallenge is a multi-turn benchmark that evaluates conversational intelligence in spoken dialogue systems and audio-language models, including both speech-to-speech and audio-input models. Unlike ASR- or TTS-based evaluations, it measures whether models can follow instructions, integrate prior context, remain self-consistent, and handle natural speech corrections across extended dialogues.
The dataset is built entirely from human speech containing disfluencies, interruptions, non-monotonic phrasing, and ambient noise, conditions under which current models frequently fail. Each conversation concludes with a set of atomic rubrics that precisely define the requirements for the model’s final response.
Using a fixed-context evaluation protocol, models are judged solely on their final turn, enabling rigorous measurement of long-horizon reasoning, speech-context integration, and robustness to real acoustic variability. Audio MultiChallenge highlights where speech-to-speech systems still lag behind their text-based counterparts. Read the paper here.
452 conversations
1,712 rubrics
47 unique speakers
14.99 hours total user audio duration
Realism: natural, conversational human speech
Sampling Rate: 48 kHz high-fidelity audio capturing breath, tone, and environmental cues
Turns per conversation: Min: 3, Median: 5, Max: 8
Each conversation ends on a user turn with rubrics evaluating only the final assistant response.
Each conversation is assigned a single primary axis; Audio-Cue-Gated is a subset of Inference Memory.
Inference Memory: 132 conversations, including 42 Audio-Cue-Gated tasks
Instruction Retention: 120
Self-Coherence: 83
Voice Editing: 117


Audio MultiChallenge is constructed through a two-stage process designed to surface realistic, reproducible failure modes in audio-language models. The pipeline combines adversarial synthetic generation with human-recorded natural conversation to ensure both coverage and authenticity. Human conversations are constrained to 3–8 turns, ensuring failures arise from reasoning and state tracking rather than unbounded context length.
We begin with a multi-agent system that identifies potential weaknesses in target models. A Planner Agent proposes strategies to break a model along a specific axis (e.g., Inference Memory), while a Tester Agent generates multi-turn probes exploring that strategy. These are converted to speech using TTS and evaluated by an Audio LLM. When a failure occurs, the Planner distills the tactic into a concise, human-friendly blueprint (a strategic guide, not a script) ensuring creativity and variation in the final dataset.
Human contributors use these blueprints to record spontaneous speech conversations with an Audio LLM. These recordings capture:
Natural disfluencies and hesitation
Non-monotonic speech: mid-sentence corrections, restarts, and barge-ins
Acoustic realism: accent variation, room noise, breath and timing artifacts
Contributors continue the dialogue until they detect a concrete failure, surfacing errors that synthetic data alone would not expose.
After each conversation, contributors create axis-specific, atomic rubrics defining exactly how the final assistant turn should be evaluated. These rubrics:
Apply only to the last model response
Are binary and self-contained
Capture the precise requirement of the failure mode
Define strict, axis-specific criteria for success (not optional preferences)
The dataset includes 1,712 rubrics across 452 conversations.
Evaluation is performed using o4-mini, selected for strong performance in rubric-based grading. Across 1,712 rubric decisions, the judge achieved Cohen’s κ ≈ 0.87 and Macro F1 ≈ 0.94 indicating high agreement with human annotators. No evaluated model is used as a judge, avoiding self-preference bias.
Average Pass Rate (APR): Primary metric. A task passes only if all rubrics are satisfied.
Average Rubric Score (ARS): Diagnostic metric. Measures partial correctness and failure structure.
To measure the modality gap, models are evaluated in two modes:
Text Output: Native text generation.
Audio Output: Native speech generation. Most models output both audio and text streams in this setup. The text stream produced by the model is evaluated using the same rubrics.
This isolates the performance cost of producing speech rather than text for architectures that support both modes
The Hardest Skills for Today’s Models
Voice Editing avg APR: 17.99%
Inference Memory avg APR: 21.55%
Models struggled to apply mid-sentence and prior-turn corrections for Voice Editing and to retrieve details present only in the acoustic signal (e.g. background noise or speaker paralinguistics) rather than the text transcript for our Audio-Cue gated Inference Memory tasks.
Models show a “modality gap”, as they are consistently stronger when generating text rather than audio. For GPT-4o Audio Preview, performance dropped from 25.44% → 23.23% when switching to native speech generation, reflecting the need for stronger S2S post-training not only for speech quality, but for multi-turn coherence and state tracking.
Real Speech ≠ Synthetic Speech
Native S2S models struggle with unscripted human audio. Most models show a relative improvement in scores when evaluated on synthetic TTS instead of real speech, confirming that clean training audio masks real-world failure modes.
Model performance remains stable across 3-8 conversational turns, indicating that turn count is not the limiting factor. In contrast, performance degrades steadily as the total duration of user audio increases. For the longest conversations (several minutes of cumulative audio), scores drop sharply, with the longest conversations (exceeding 8 minutes of user audio) dropping to roughly 13% on average. These longest-duration results are based on a small number of samples but indicate a clear challenge in long-form audio processing.
gemini-3-pro-preview*
54.65±4.57
gemini-2.5-pro*
46.90±4.58
gemini-2.5-flash (Thinking)*
40.04±4.50
Voxtral-Small-24B-2507*
26.33±4.05
gemini-2.5-flash*
26.11±4.04
gpt-4o-audio-preview-2025-06-03*
25.44±4.00
Qwen3-Omni-30B-A3B-Instruct†
24.34±3.95
gpt-realtime-2025-08-28*
23.45±3.90
gpt-4o-audio-preview-2025-06-03†
23.23±3.88
gpt-realtime-2025-08-28†
20.35±3.70
MiMo-Audio-7B-Instruct (Thinking)*
19.69±3.66
MiMo-Audio-7B-Instruct*
18.58±3.58
gemma-3n-E4B-it*
15.49±3.33
Phi-4-multimodal-instruct*
15.49±3.33
gpt-4o-mini-audio-preview-2024-12-17*
14.82±3.28
Kimi-Audio-7B-Instruct*
13.72±3.17
gpt-4o-mini-audio-preview-2024-12-17†
13.05±3.11
Qwen2.5-Omni-7B*
11.95±2.99
Kimi-Audio-7B-Instruct†
10.40±2.82
LFM2-Audio-1.5B†
9.29±2.69
Rank (UB): 1 + the number of models whose lower CI bound exceeds this model’s upper CI bound.
* indicates Text output. † indicates Audio output.