Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Should We Extract Discrete Audio Tokens from Self-Supervised Models? (2406.10735v1)

Published 15 Jun 2024 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: Discrete audio tokens have recently gained attention for their potential to bridge the gap between audio and language processing. Ideal audio tokens must preserve content, paralinguistic elements, speaker identity, and many other audio details. Current audio tokenization methods fall into two categories: Semantic tokens, acquired through quantization of Self-Supervised Learning (SSL) models, and Neural compression-based tokens (codecs). Although previous studies have benchmarked codec models to identify optimal configurations, the ideal setup for quantizing pretrained SSL models remains unclear. This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers. Furthermore, an attention mechanism is employed to identify task-specific influential layers, enhancing the adaptability and performance of semantic tokens in diverse audio applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Pooneh Mousavi (9 papers)
  2. Jarod Duret (10 papers)
  3. Salah Zaiem (17 papers)
  4. Luca Della Libera (14 papers)
  5. Artem Ploujnikov (6 papers)
  6. Cem Subakan (35 papers)
  7. Mirco Ravanelli (72 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com