USAD: Universal Speech and Audio Representation via Distillation (2506.18843v1)
Abstract: Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.
Summary
- The paper introduces USAD, a novel framework that distills knowledge from separate speech and general audio self-supervised models into a single encoder capable of creating universal representations for diverse audio tasks.
- USAD employs a dual-teacher, sparse layer-to-layer distillation strategy with an efficient L1-cosine loss, significantly reducing computational overhead by approximately 75% compared to dense distillation.
- Experimental results show that USAD achieves competitive performance across a wide range of speech and non-speech audio benchmarks (SUPERB, HEAR, AudioSet), demonstrating the effectiveness and efficiency of unified audio representation learning for simplified downstream applications.
Universal Speech and Audio Distillation: A Unified Approach to Audio Representation Learning
The paper "USAD: Universal Speech and Audio Representation via Distillation" (2506.18843) addresses the persistent fragmentation in self-supervised audio representation learning, where models are typically specialized for either speech or non-speech (sound/music) domains. The authors propose Universal Speech and Audio Distillation (USAD), a unified framework that leverages knowledge distillation from domain-specific self-supervised learning (SSL) models to train a single encoder capable of extracting general-purpose representations across speech, sound, and music.
Methodology
USAD is built upon the insight that while speech and non-speech audio share underlying signal characteristics, existing SSL models are optimized for their respective domains, leading to suboptimal cross-domain generalization. The USAD framework employs a dual-teacher, sparse layer-to-layer (L2L) distillation strategy:
- Dual-Teacher Distillation: USAD simultaneously distills knowledge from two pre-trained SSL models—one specialized in speech (e.g., WavLM Base+) and one in general audio (e.g., ATST Frame). Both teachers process the same mixed-domain input, and the student model is trained to match their intermediate representations.
- Sparse L2L Distillation: Instead of dense, computationally expensive layer-wise matching, USAD distills only from a subset of layers (e.g., 4 out of 12), leveraging the redundancy between adjacent transformer layers. This reduces computational overhead by approximately 75% compared to dense L2L approaches.
- Loss Function: The distillation objective combines L1 distance and cosine similarity between the student’s predicted features and the teachers’ feed-forward network (FFN) outputs, eschewing contrastive losses and negative sampling for efficiency.
A critical design choice is the use of frame-based feature extraction for both teachers and the student, ensuring temporal alignment and preserving fine-grained information necessary for speech tasks, while maintaining sufficient generality for non-speech audio.
Experimental Results
USAD is evaluated on a comprehensive suite of benchmarks, including SUPERB (speech), HEAR (holistic audio), AudioSet (audio tagging), and ESC-50 (sound classification). The training corpus, Mix126k-B, is a balanced mixture of large-scale speech, sound, and music datasets, with upsampling to ensure domain parity.
Key findings include:
- Competitive Performance Across Domains: USAD Base (94M parameters) achieves a SUPERB average score of 787.0, outperforming all audio SSL baselines and closely matching or surpassing domain-specific teacher models in both speech and audio tasks.
- Scalability: Increasing model size (USAD Large, 330M parameters) yields further gains, with an average SUPERB score of 851.7 and strong results on HEAR, closing the gap with state-of-the-art task-specific models.
- Efficiency: Sparse L2L distillation and the L1-cosine loss enable USAD to reach high performance with significantly reduced computational cost compared to dense distillation or contrastive learning approaches.
- Ablation Studies: The choice of teacher models, data distribution, and distillation strategy are all shown to impact downstream performance. Notably, frame-based teachers and balanced training data are essential for robust cross-domain generalization.
Notable Numerical Results
- On SUPERB, USAD Base achieves 868.9 (frame-level speech), 938.0 (instance-level speech), and 554.2 (audio) scores, with an overall average of 787.0.
- On HEAR, USAD Large attains an average score of 79.7, surpassing the concatenated teacher topline (78.5) and approaching the best per-task results on several benchmarks.
- USAD models consistently outperform single-domain SSL models in joint evaluations, demonstrating the effectiveness of the unified approach.
Implications and Future Directions
USAD demonstrates that a single encoder, distilled from multiple domain-specific SSL experts, can achieve near state-of-the-art performance across a wide range of speech and audio tasks. This unification has several practical and theoretical implications:
- Simplified Downstream Integration: Multimodal and audio-enabled systems (e.g., audio-LLMs, speech-to-audio generation) can leverage a single, general-purpose encoder, reducing system complexity and maintenance overhead.
- Resource Efficiency: Sparse distillation and unified training reduce the need for maintaining and deploying multiple large models, which is particularly beneficial for edge and real-time applications.
- Foundation for Multimodal AI: As audio becomes increasingly central in multimodal AI, universal representations such as those produced by USAD are likely to become foundational components for large-scale, cross-domain models.
The authors identify several avenues for future work, including extending USAD to multilingual speech, improving robustness to domain shifts, and integrating the framework into large audio-LLMs. The demonstrated scalability and efficiency of USAD suggest that further gains are possible with larger models and more diverse training data.
Conclusion
USAD provides a principled and efficient solution to the challenge of universal audio representation learning. By distilling from multiple domain-specific SSL models using a sparse, frame-aligned strategy, USAD achieves strong, balanced performance across speech, sound, and music tasks. The approach offers a practical path toward unified audio encoders, with significant implications for the development of generalist AI systems and multimodal applications.
Related Papers
- UniAudio: An Audio Foundation Model Toward Universal Audio Generation (2023)
- Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks (2023)
- SSAST: Self-Supervised Audio Spectrogram Transformer (2021)
- EnCodecMAE: Leveraging neural codecs for universal audio representation learning (2023)
- Distilling a speech and music encoder with task arithmetic (2025)