Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training

Published 12 Apr 2026 in cs.SD | (2604.10438v1)

Abstract: Audio-native LLMs (audio-LLMs) commonly use Whisper as their audio encoder. However, Whisper was trained exclusively on speech data, producing weak representations for music and environmental sound. This forces downstream audio-LLMs to compensate through extensive training on large-scale non-speech data. We present Whisper-AuT, a domain-adapted audio encoder obtained by fine-tuning Whisper-large-v3 on a curated mixture of speech (80%), environmental sound (10%), and music (10%) totaling approximately 20M samples. The full encoder-decoder is trained end-to-end with a seq2seq captioning objective; the decoder is then discarded and only the encoder is retained. Linear probe evaluations show that Whisper-AuT achieves +23.0% on ESC-50 (environmental sound), +5.0% on GTZAN (music genre), and +0.7% on Speech Commands (keyword spotting) compared to the original Whisperlarge-v3 encoder. Whisper-AuT is designed as a drop-in replacement for Whisper in audio-LLM architectures, with the goal of reducing downstream training cost by providing stronger initial audio representations for non-speech domains.

Abstract PDF Upgrade to Chat

Authors (14)

Summary

The paper introduces Whisper-AuT, a domain-adapted variant of Whisper-large-v3 that boosts environmental sound classification accuracy by 23.0% compared to its predecessor.
The paper employs end-to-end fine-tuning on 20M audio-text pairs spanning speech, music, and environmental sounds to achieve efficient, robust cross-domain audio representations.
The paper shows that incorporating a small fraction of non-speech data improves downstream tasks without compromising speech recognition, reducing overall training overhead.

Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training

Introduction

This work addresses the persistent domain mismatch in audio encoding for LLMs operating on diverse audio modalities. Standard practice involves using Whisper-large-v3 as the primary audio encoder due to its superior ASR capabilities. However, the encoder's speech-only pretraining limits its representational capacity for non-speech audio such as music and environmental sounds, imposing a training and efficiency burden on downstream audio-LLMs. The proposed solution, Whisper-AuT, is a domain-adapted variant of Whisper-large-v3, fine-tuned on a heterogeneous corpus incorporating speech, environmental audio, and music. The objective is to yield a plug-and-play encoder that provides robust initial representations across modalities, reducing the overhead and inefficiencies encountered during end-to-end audio-LLM adaptation.

Methodology

Whisper-AuT is obtained by fine-tuning Whisper-large-v3 end-to-end on a curated dataset of approximately 20M (audio, text) pairs. The domain allocation within the dataset is 80% speech, 10% music, and 10% environmental sound, with text annotations for non-speech samples synthesized using commercial captioning models. Training employs the standard encoder-decoder seq2seq paradigm, maintaining the original architecture and updating all parameters end-to-end. Post training, the decoder is discarded, and only the encoder is retained for use.

The training uses a batch size of 128, runs for 2 epochs (∼23 hours on 8 H200 GPUs), and leverages DeepSpeed ZeRO-2 with bfloat16 precision. The procedure is computationally efficient compared to alternatives that train from scratch on multi-domain data.

Evaluation Protocol

Evaluation utilizes linear probing to quantify the linear separability of encoder representations across domains. For each benchmark, mean-pooled encoder representations are extracted and a linear classifier is trained to predict benchmark labels. Three representative datasets are employed:

ESC-50: Environmental sound classification (50 classes).
GTZAN: Music genre classification (10 classes).
Speech Commands: Keyword spotting (12 classes).

The primary metric is classification accuracy, allowing direct comparison between Whisper-large-v3 and Whisper-AuT encoders.

Results

Whisper-AuT yields significant improvements in non-speech domains without sacrificing speech recognition performance:

ESC-50: Accuracy increases from 54.5% (Whisper) to 77.5% (Whisper-AuT), an absolute gain of +23.0%.
GTZAN: Accuracy increases from 81.0% to 86.0%, an absolute gain of +5.0%.
Speech Commands: Slight increase from 87.6% to 88.3% (+0.7%), confirming speech capabilities are preserved.

These results empirically support the hypothesis that even limited exposure to non-speech data (20% of total) suffices to rectify the representational bottleneck of the original encoder. Training dynamics indicate rapid convergence without overfitting (eval loss declines monotonically), suggesting potential for further improvement with extended training or larger datasets.

Analysis and Implications

Whisper-AuT’s efficiency and effectiveness are underscored by its strong results with only ~20M samples and modest compute. In contrast to approaches such as Qwen3-Omni’s AuT encoder—which achieves high multi-domain generalization by training from scratch on 20M hours—Whisper-AuT builds on the inductive bias of Whisper’s original 680K-hour speech pretraining to achieve comparable representational improvements for non-speech modalities at a fraction of the data and cost.

From a practical standpoint, Whisper-AuT is drop-in compatible with existing audio-LLM architectures, including pipelines such as xVox-Audio-Captioner. The improved encoder is expected to:

Reduce the quantity of non-speech training data needed downstream.
Accelerate convergence during multi-domain training.
Improve final downstream performance, especially on non-speech captioning tasks.

Incorporating Whisper-AuT into audio-LLMs directly addresses the inefficiency present in pipelines where the Whisper encoder must learn non-speech representations indirectly through the LLM objective. Theoretical implications extend to the design of multi-domain pretraining curricula, suggesting that modest proportions of non-speech data are sufficient to endow deep audio encoders with substantially broadened representational power.

Whisper-AuT's protocol and findings are positioned within ongoing efforts to bridge speech and general audio representation learning. Prior efforts such as Whisper-AT appended lightweight heads for non-speech tasks but did not update encoder parameters. Other models, e.g., Qwen3-Omni AuT, construct new architectures and train from scratch on massive, balanced corpora. Whisper-AuT differentiates itself by fine-tuning an established, high-performing speech encoder with a modest, carefully curated non-speech mixture, ensuring both efficiency and compatibility with prevailing audio-LLM designs.

Conclusion

Whisper-AuT represents a pragmatic and technically sound advancement for efficient and universal audio representation learning in LLM-centric pipelines. By fine-tuning Whisper-large-v3 on a realistic blend of speech, music, and environmental sounds, Whisper-AuT achieves substantial representational improvements for non-speech domains with minimal data and compute overhead, without compromising speech recognition. Its role as a customizable, high-utility audio encoder is likely to reduce training cost, accelerate model development, and enhance performance in future generations of audio-LLMs. Integration and validation within downstream systems such as xVox-Audio-Captioner are natural next steps for this line of research.

Reference: "Whisper-AuT: Domain-Adapted Audio Encoder for Efficient Audio-LLM Training" (2604.10438)

Markdown Report Issue