Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Discrete Audio Tokens: More Than a Survey! (2506.10274v2)

Published 12 Jun 2025 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: Discrete audio tokens are compact representations that aim to preserve perceptual quality, phonetic content, and speaker characteristics while enabling efficient storage and inference, as well as competitive performance across diverse downstream tasks. They provide a practical alternative to continuous features, enabling the integration of speech and audio into modern LLMs. As interest in token-based audio processing grows, various tokenization methods have emerged, and several surveys have reviewed the latest progress in the field. However, existing studies often focus on specific domains or tasks and lack a unified comparison across various benchmarks. This paper presents a systematic review and benchmark of discrete audio tokenizers, covering three domains: speech, music, and general audio. We propose a taxonomy of tokenization approaches based on encoder-decoder, quantization techniques, training paradigm, streamability, and application domains. We evaluate tokenizers on multiple benchmarks for reconstruction, downstream performance, and acoustic LLMing, and analyze trade-offs through controlled ablation studies. Our findings highlight key limitations, practical considerations, and open challenges, providing insight and guidance for future research in this rapidly evolving area. For more information, including our main results and tokenizer database, please refer to our website: https://poonehmousavi.github.io/dates-website/.

An In-Depth Analysis of Discrete Audio Tokenization

In "Discrete Audio Tokens: More Than a Survey," the authors present a comprehensive examination of discrete audio tokenization techniques, focusing on their potential to revolutionize the integration of audio processing within LLMs. This paper offers a systematic review and benchmarking of diverse tokenization methods across three primary domains: speech, music, and general audio. By assessing multiple facets of tokenization, including encoder-decoder architectures, quantization techniques, and training paradigms, the authors aim to establish a cohesive understanding of how discrete audio tokens can serve modern, multimodal AI systems.

Key Findings

  1. Taxonomy of Tokenization Approaches: The paper introduces a taxonomy categorizing tokenization methods based on five major criteria: encoder-decoder architectures, quantization techniques, training paradigms, streamability, and application domains. This structured approach provides clarity on different existing methodologies and highlights the critical architectural choices in designing effective audio tokenization systems.
  2. Benchmark Evaluation: Detailed benchmarks are conducted, covering aspects such as audio reconstruction, downstream task performance, and acoustic LLMing. Evaluations leverage existing and newly introduced benchmarks like Codec-SUPERB, DASB, and SALMon. These analyses reveal significant performance differences across tokenizers trained under various conditions, highlighting the importance of a consistent and standardized evaluation protocol to obtain comparable metrics.
  3. Ablation Studies: The authors perform controlled experiments to evaluate specific design choices' impact on training audio tokenizers. This includes quantization methods, the effect of sampling rates, and training on single-domain vs. multi-domain datasets using a standardized framework (ESPnet-Codec). Findings from these studies suggest that domain-specific training improves token reconstruction quality but often fails to generalize, emphasizing the need for future research into cross-domain tokenization strategies.
  4. Implications and Future Directions: The paper explores both the theoretical implications and practical applications of discrete audio tokens. It suggests their utility in bridging text-audio processing gaps and highlights their efficiency in storage, transmission, and integration within multimodal models. The authors speculate that continued advancements in robust tokenization techniques, including semantic distillation and better quantization strategies, will significantly enhance their utility in generative AI tasks alongside traditional audio applications.

Speculative Outlook on Future Developments in AI

  1. Enhanced Multimodal Integration: As AI systems increasingly evolve to handle multimodal inputs, discrete audio tokens will play a pivotal role in facilitating seamless integration of audio layers within text-centric LLM frameworks. This approach fosters richer interactions and better synthesis, understanding, and reasoning capabilities across tasks requiring a combination of auditory and textual data.
  2. Scalability and Efficiency: Token-based frameworks offer substantial promise in reducing computational overhead and allowing faster processing speeds, which could revolutionize real-time applications ranging from automated transcription to interactive voice response systems. This scalability emerges from the modular nature of tokens, with clear versatility across tasks like speech synthesis, translation, and enhancement.
  3. Potential for Generalization: Despite current limitations around cross-domain generalization, pioneering research models capable of capturing abstract representations shared across speech, music, and audio may facilitate the design of universal frameworks that transcend domain-specific barriers. Solving these challenges will be integral to creating more harmonious, scalable, and robust tokenization systems applicable across diverse fields in AI technology.

In summary, "Discrete Audio Tokens: More Than a Survey" provides a vital foundation for ongoing research and advancements in discrete audio tokenization. As the AI domain continues to push boundaries, facilitating better text-audio integration within multimodal systems promises a transformative leap in how machines understand and interact with complex auditory environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (21)
  1. Pooneh Mousavi (9 papers)
  2. Gallil Maimon (8 papers)
  3. Adel Moumen (7 papers)
  4. Darius Petermann (11 papers)
  5. Jiatong Shi (82 papers)
  6. Haibin Wu (84 papers)
  7. Haici Yang (10 papers)
  8. Anastasia Kuznetsova (5 papers)
  9. Artem Ploujnikov (6 papers)
  10. Ricard Marxer (21 papers)
  11. Bhuvana Ramabhadran (47 papers)
  12. Benjamin Elizalde (26 papers)
  13. Loren Lugosch (13 papers)
  14. Jinyu Li (164 papers)
  15. Cem Subakan (35 papers)
  16. Phil Woodland (7 papers)
  17. Minje Kim (53 papers)
  18. Hung-yi Lee (325 papers)
  19. Shinji Watanabe (416 papers)
  20. Yossi Adi (96 papers)
Github Logo Streamline Icon: https://streamlinehq.com