SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models (2308.16692v2)

Published 31 Aug 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Current speech LLMs build upon discrete speech representations, which can be categorized into semantic tokens and acoustic tokens. However, existing speech tokens are not specifically designed for speech LLMing. To assess the suitability of speech tokens for building speech LLMs, we established the first benchmark, SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech tokenizer for speech LLMs. SpeechTokenizer adopts the Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of speech information hierarchically across different RVQ layers. Furthermore, We construct a Unified Speech LLM (USLM) leveraging SpeechTokenizer. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks. Code and models are available at https://github.com/ZhangXInFD/SpeechTokenizer/.

PDF Abstract

An Overview of "SpeechTokenizer: Unified Speech Tokenizer for Speech LLMs"

The paper "SpeechTokenizer: Unified Speech Tokenizer for Speech LLMs" presents an innovative approach to improving the efficiency and effectiveness of speech LLMs (SLMs) by introducing a unified speech tokenizer, named SpeechTokenizer. The motivation behind this work is to address the limitations associated with existing speech tokens, which have been traditionally bifurcated into semantic and acoustic tokens. These token types have not been specifically optimized for speech LLMing, leading to inefficiencies and performance bottlenecks.

Key Contributions and Methodology

The primary contribution of this work is the development of SpeechTokenizer, which employs an Encoder-Decoder architecture augmented with residual vector quantization (RVQ) mechanisms. This system unifies semantic and acoustic tokens, allowing for a hierarchical disentanglement of various aspects of speech information across different RVQ layers. As a result, SpeechTokenizer addresses the dual requirements of preserving speech information and maintaining alignment with textual content by encoding different aspects of speech hierarchically.

The authors also introduce SLMTokBench, the first benchmark specifically designed to evaluate the suitability of speech tokens in constructing speech LLMs. Experiments on this benchmark reveal that neither existing semantic nor acoustic tokens adequately meet the needs of SLMs, supporting the necessity of a unified approach.

To validate their approach, the authors constructed a Unified Speech LLM (USLM) based on SpeechTokenizer. This model utilizes both autoregressive and non-autoregressive components, enabling seamless conversion between text and speech while maintaining high-quality, contextually relevant output.

Experimental Results

The experimental evaluation demonstrates several compelling findings:

Speech Reconstruction: SpeechTokenizer achieves comparable performance to EnCodec in terms of speech reconstruction, suggesting that the SpeechTokenizer can maintain audio quality effectively.
Text-to-Speech Transformation: The USLM outperforms VALL-E in zero-shot Text-to-Speech (TTS) tasks, achieving lower word error rates (WER) and better speaker similarity ratings. This indicates SpeechTokenizer's superior capability in handling diverse and complex speech tasks.
Evaluation on SLMTokBench: On SLMTokBench, the SpeechTokenizer displayed strong alignment with text while preserving essential speech information, outperforming prior methods using standalone semantic or acoustic tokens.

These results underpin the claim that the SpeechTokenizer approach effectively reduces redundancy and complexity inherent in multi-stage SLM architectures, offering a more streamlined and efficient framework for speech processing.

Implications and Future Directions

The introduction of SpeechTokenizer could have significant implications for the field of speech processing, prompting a shift towards systems that integrate both semantic and acoustic features in a unified tokenization space. The potential applications are vast, ranging from more efficient deployment of SLMs in resource-constrained environments to improved synthesis and recognition systems that require less training data and processing power.

For future developments, expanding the scope of SpeechTokenizer to encompass multilingual contexts could open further possibilities in global speech applications, where language-based nuances need to be accounted for. Additionally, exploring how this unified tokenization can enhance other modalities of language processing, such as video-to-text or multimodal understanding, presents a promising avenue for extending this research.

In conclusion, the SpeechTokenizer and its accompanying findings herald a significant advancement in the design and utility of speech LLMs, marking a step-change in how speech tokenization is conceptualized and implemented within AI systems.