Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS
The paper "Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS" explores the universality and efficacy of discrete speech tokens across multiple speech processing tasks, specifically focusing on Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). Using discrete tokens derived from Self-Supervised Learning (SSL) models, the authors aim to compare these with traditional feature representations in speech processing tasks, offering potential improvements in storage efficiency and model performance.
Methodological Insights
The researchers conducted an extensive paper utilizing discrete tokens generated by four prominent SSL models: vq-wav2vec, EnCodec, HuBERT, and WavLM. These tokens were assessed for their utility in ASR and TTS tasks:
- ASR Study: Discrete tokens were used to train End-to-End (E2E) ASR models on various datasets including LibriSpeech and GigaSpeech. The researchers introduced specialized data augmentation strategies to address overfitting and improve robustness with discrete tokens. The models were evaluated in terms of Word Error Rate (WER) and Character Error Rate (CER).
- TTS Study: The TTS evaluation focused on resynthesis tasks to gauge the upper bound of synthesis quality achievable with discrete tokens. Techniques such as CTX-vec2wav, enhanced by mel-spectrogram prompts, were applied to assess performance compared to traditional mel-spectrogram features.
Key Findings
- ASR Performance: Discrete tokens obtained from HuBERT and WavLM offered competitive performance relative to traditional FBank features, especially in low-resource scenarios. However, tokens from models like EnCodec and vq-wav2vec showed lower effectiveness.
- TTS Performance: In TTS tasks, discrete tokens, with the exception of EnCodec, delivered high-quality audio outputs comparable to mel-spectrogram features. Notably, DAC tokens demonstrated superior resynthesis quality without additional fine-tuning.
Implications and Future Work
This paper highlights the potential for discrete tokens to replace traditional speech features in various applications, offering advantages in storage and processing. The empirical results suggest that these tokens can maintain, if not exceed, the performance of conventional methods in both ASR and TTS tasks.
Theoretical implications extend to cross-modal exploration, where discrete tokens can serve as a bridge between speech and text representations. Future research might explore the generalization of these tokens across languages and investigate further optimization techniques for multi-task scenarios.
The research serves as a baseline for continued investigation into more efficient and effective universal models for speech processing, aiming to unify the representation of spoken and written language. The open-source release of their work aligns well with ongoing collaborative advancements in this domain.