Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
12 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
37 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

USAD: Universal Speech and Audio Representation via Distillation (2506.18843v1)

Published 23 Jun 2025 in cs.SD, cs.CL, and eess.AS

Abstract: Self-supervised learning (SSL) has revolutionized audio representations, yet models often remain domain-specific, focusing on either speech or non-speech tasks. In this work, we present Universal Speech and Audio Distillation (USAD), a unified approach to audio representation learning that integrates diverse audio types - speech, sound, and music - into a single model. USAD employs efficient layer-to-layer distillation from domain-specific SSL models to train a student on a comprehensive audio dataset. USAD offers competitive performance across various benchmarks and datasets, including frame and instance-level speech processing tasks, audio tagging, and sound classification, achieving near state-of-the-art results with a single encoder on SUPERB and HEAR benchmarks.

Summary

  • The paper introduces USAD, a novel framework that distills knowledge from separate speech and general audio self-supervised models into a single encoder capable of creating universal representations for diverse audio tasks.
  • USAD employs a dual-teacher, sparse layer-to-layer distillation strategy with an efficient L1-cosine loss, significantly reducing computational overhead by approximately 75% compared to dense distillation.
  • Experimental results show that USAD achieves competitive performance across a wide range of speech and non-speech audio benchmarks (SUPERB, HEAR, AudioSet), demonstrating the effectiveness and efficiency of unified audio representation learning for simplified downstream applications.

Universal Speech and Audio Distillation: A Unified Approach to Audio Representation Learning

The paper "USAD: Universal Speech and Audio Representation via Distillation" (2506.18843) addresses the persistent fragmentation in self-supervised audio representation learning, where models are typically specialized for either speech or non-speech (sound/music) domains. The authors propose Universal Speech and Audio Distillation (USAD), a unified framework that leverages knowledge distillation from domain-specific self-supervised learning (SSL) models to train a single encoder capable of extracting general-purpose representations across speech, sound, and music.

Methodology

USAD is built upon the insight that while speech and non-speech audio share underlying signal characteristics, existing SSL models are optimized for their respective domains, leading to suboptimal cross-domain generalization. The USAD framework employs a dual-teacher, sparse layer-to-layer (L2L) distillation strategy:

  • Dual-Teacher Distillation: USAD simultaneously distills knowledge from two pre-trained SSL models—one specialized in speech (e.g., WavLM Base+) and one in general audio (e.g., ATST Frame). Both teachers process the same mixed-domain input, and the student model is trained to match their intermediate representations.
  • Sparse L2L Distillation: Instead of dense, computationally expensive layer-wise matching, USAD distills only from a subset of layers (e.g., 4 out of 12), leveraging the redundancy between adjacent transformer layers. This reduces computational overhead by approximately 75% compared to dense L2L approaches.
  • Loss Function: The distillation objective combines L1 distance and cosine similarity between the student’s predicted features and the teachers’ feed-forward network (FFN) outputs, eschewing contrastive losses and negative sampling for efficiency.

A critical design choice is the use of frame-based feature extraction for both teachers and the student, ensuring temporal alignment and preserving fine-grained information necessary for speech tasks, while maintaining sufficient generality for non-speech audio.

Experimental Results

USAD is evaluated on a comprehensive suite of benchmarks, including SUPERB (speech), HEAR (holistic audio), AudioSet (audio tagging), and ESC-50 (sound classification). The training corpus, Mix126k-B, is a balanced mixture of large-scale speech, sound, and music datasets, with upsampling to ensure domain parity.

Key findings include:

  • Competitive Performance Across Domains: USAD Base (94M parameters) achieves a SUPERB average score of 787.0, outperforming all audio SSL baselines and closely matching or surpassing domain-specific teacher models in both speech and audio tasks.
  • Scalability: Increasing model size (USAD Large, 330M parameters) yields further gains, with an average SUPERB score of 851.7 and strong results on HEAR, closing the gap with state-of-the-art task-specific models.
  • Efficiency: Sparse L2L distillation and the L1-cosine loss enable USAD to reach high performance with significantly reduced computational cost compared to dense distillation or contrastive learning approaches.
  • Ablation Studies: The choice of teacher models, data distribution, and distillation strategy are all shown to impact downstream performance. Notably, frame-based teachers and balanced training data are essential for robust cross-domain generalization.

Notable Numerical Results

  • On SUPERB, USAD Base achieves 868.9 (frame-level speech), 938.0 (instance-level speech), and 554.2 (audio) scores, with an overall average of 787.0.
  • On HEAR, USAD Large attains an average score of 79.7, surpassing the concatenated teacher topline (78.5) and approaching the best per-task results on several benchmarks.
  • USAD models consistently outperform single-domain SSL models in joint evaluations, demonstrating the effectiveness of the unified approach.

Implications and Future Directions

USAD demonstrates that a single encoder, distilled from multiple domain-specific SSL experts, can achieve near state-of-the-art performance across a wide range of speech and audio tasks. This unification has several practical and theoretical implications:

  • Simplified Downstream Integration: Multimodal and audio-enabled systems (e.g., audio-LLMs, speech-to-audio generation) can leverage a single, general-purpose encoder, reducing system complexity and maintenance overhead.
  • Resource Efficiency: Sparse distillation and unified training reduce the need for maintaining and deploying multiple large models, which is particularly beneficial for edge and real-time applications.
  • Foundation for Multimodal AI: As audio becomes increasingly central in multimodal AI, universal representations such as those produced by USAD are likely to become foundational components for large-scale, cross-domain models.

The authors identify several avenues for future work, including extending USAD to multilingual speech, improving robustness to domain shifts, and integrating the framework into large audio-LLMs. The demonstrated scalability and efficiency of USAD suggest that further gains are possible with larger models and more diverse training data.

Conclusion

USAD provides a principled and efficient solution to the challenge of universal audio representation learning. By distilling from multiple domain-specific SSL models using a sparse, frame-aligned strategy, USAD achieves strong, balanced performance across speech, sound, and music tasks. The approach offers a practical path toward unified audio encoders, with significant implications for the development of generalist AI systems and multimodal applications.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com