Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS (2309.07377v2)

Published 14 Sep 2023 in eess.AS and cs.SD

Abstract: Self-supervised learning (SSL) proficiency in speech-related tasks has driven research into utilizing discrete tokens for speech tasks like recognition and translation, which offer lower storage requirements and great potential to employ natural language processing techniques. However, these studies, mainly single-task focused, faced challenges like overfitting and performance degradation in speech recognition tasks, often at the cost of sacrificing performance in multi-task scenarios. This study presents a comprehensive comparison and optimization of discrete tokens generated by various leading SSL models in speech recognition and synthesis tasks. We aim to explore the universality of speech discrete tokens across multiple speech tasks. Experimental results demonstrate that discrete tokens achieve comparable results against systems trained on FBank features in speech recognition tasks and outperform mel-spectrogram features in speech synthesis in subjective and objective metrics. These findings suggest that universal discrete tokens have enormous potential in various speech-related tasks. Our work is open-source and publicly available at https://github.com/k2-fsa/icefall.

Authors (7)

Yifan Yang (578 papers)
Feiyu Shen (6 papers)
Chenpeng Du (28 papers)
Ziyang Ma (73 papers)
Kai Yu (202 papers)
Daniel Povey (45 papers)
Xie Chen (166 papers)

Citations (22)

View on Semantic Scholar

Summary

Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS

The paper "Towards Universal Speech Discrete Tokens: A Case Study for ASR and TTS" explores the universality and efficacy of discrete speech tokens across multiple speech processing tasks, specifically focusing on Automatic Speech Recognition (ASR) and Text-to-Speech (TTS). Using discrete tokens derived from Self-Supervised Learning (SSL) models, the authors aim to compare these with traditional feature representations in speech processing tasks, offering potential improvements in storage efficiency and model performance.

Methodological Insights

The researchers conducted an extensive paper utilizing discrete tokens generated by four prominent SSL models: vq-wav2vec, EnCodec, HuBERT, and WavLM. These tokens were assessed for their utility in ASR and TTS tasks:

ASR Study: Discrete tokens were used to train End-to-End (E2E) ASR models on various datasets including LibriSpeech and GigaSpeech. The researchers introduced specialized data augmentation strategies to address overfitting and improve robustness with discrete tokens. The models were evaluated in terms of Word Error Rate (WER) and Character Error Rate (CER).
TTS Study: The TTS evaluation focused on resynthesis tasks to gauge the upper bound of synthesis quality achievable with discrete tokens. Techniques such as CTX-vec2wav, enhanced by mel-spectrogram prompts, were applied to assess performance compared to traditional mel-spectrogram features.

Key Findings

ASR Performance: Discrete tokens obtained from HuBERT and WavLM offered competitive performance relative to traditional FBank features, especially in low-resource scenarios. However, tokens from models like EnCodec and vq-wav2vec showed lower effectiveness.
TTS Performance: In TTS tasks, discrete tokens, with the exception of EnCodec, delivered high-quality audio outputs comparable to mel-spectrogram features. Notably, DAC tokens demonstrated superior resynthesis quality without additional fine-tuning.

Implications and Future Work

This paper highlights the potential for discrete tokens to replace traditional speech features in various applications, offering advantages in storage and processing. The empirical results suggest that these tokens can maintain, if not exceed, the performance of conventional methods in both ASR and TTS tasks.

Theoretical implications extend to cross-modal exploration, where discrete tokens can serve as a bridge between speech and text representations. Future research might explore the generalization of these tokens across languages and investigate further optimization techniques for multi-task scenarios.

The research serves as a baseline for continued investigation into more efficient and effective universal models for speech processing, aiming to unify the representation of spoken and written language. The open-source release of their work aligns well with ongoing collaborative advancements in this domain.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - k2-fsa/icefall (922 stars)