Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Vec-Tok Speech: speech vectorization and tokenization for neural speech generation (2310.07246v2)

Published 11 Oct 2023 in cs.SD and eess.AS

Abstract: LLMs (LMs) have recently flourished in natural language processing and computer vision, generating high-fidelity texts or images in various tasks. In contrast, the current speech generative models are still struggling regarding speech quality and task generalization. This paper presents Vec-Tok Speech, an extensible framework that resembles multiple speech generation tasks, generating expressive and high-fidelity speech. Specifically, we propose a novel speech codec based on speech vectors and semantic tokens. Speech vectors contain acoustic details contributing to high-fidelity speech reconstruction, while semantic tokens focus on the linguistic content of speech, facilitating LLMing. Based on the proposed speech codec, Vec-Tok Speech leverages an LM to undertake the core of speech generation. Moreover, Byte-Pair Encoding (BPE) is introduced to reduce the token length and bit rate for lower exposure bias and longer context coverage, improving the performance of LMs. Vec-Tok Speech can be used for intra- and cross-lingual zero-shot voice conversion (VC), zero-shot speaking style transfer text-to-speech (TTS), speech-to-speech translation (S2ST), speech denoising, and speaker de-identification and anonymization. Experiments show that Vec-Tok Speech, built on 50k hours of speech, performs better than other SOTA models. Code will be available at https://github.com/BakerBunker/VecTok .

Overview of "Vec-Tok Speech: Speech Vectorization and Tokenization for Neural Speech Generation"

The paper entitled "Vec-Tok Speech: Speech Vectorization and Tokenization for Neural Speech Generation" introduces a framework aimed at enhancing the capabilities of speech generation systems. This framework, termed Vec-Tok Speech, focuses on utilizing a novel codec for speech vectorization and tokenization to address the limitations observed in existing speech generative models in terms of speech quality and task generalization.

Core Innovations

  1. Novel Speech Codec: The core innovation in Vec-Tok Speech is the development of a new codec that effectively combines speech vectors and semantic tokens. This dual representation captures both the acoustic and linguistic elements of speech. Speech vectors are designed to retain detailed acoustic features necessary for high-fidelity speech reconstruction, while semantic tokens encapsulate linguistic content, facilitating efficient LLMing.
  2. Large-Scale Data Utilization: The Vec-Tok Speech model is trained on a massive dataset of 50,000 hours of multi-domain speech, allowing it to perform competitively across various speech tasks such as voice conversion (VC), text-to-speech (TTS), and speech-to-speech translation (S2ST), both intra- and cross-lingually.
  3. Byte-Pair Encoding (BPE) for Token Optimization: To reduce token length and improve the efficiency of LLMs (LMs), the framework incorporates Byte-Pair Encoding (BPE). This reduces exposure bias and extends context coverage, which enhances the flexibility and robustness of speech generation tasks.

Experimental Results

The experimentation demonstrated the model's superiority over state-of-the-art (SOTA) models in several key metrics:

  • Speech Quality: Vec-Tok Speech achieved higher mean opinion scores (MOS) in speech naturalness when compared to models like LM-VC and VALL-E X for zero-shot VC and TTS tasks, respectively.
  • Speaker Identity Preservation: The model effectively maintains the speaker's identity across translations, as evidenced by high speaker similarity scores and cosine similarity metrics.
  • Zero-shot Capability: The framework exhibits robust zero-shot performance, particularly for TTS applications, allowing for style transfer using separate prompts for speaker and style identity, which is a novel capability not offered by peer models.

Theoretical and Practical Implications

The introduction of Vec-Tok Speech implies a significant stride towards bridging the gap between text and speech modalities using large-scale LLMs. The dual focus on high-fidelity reconstruction and efficient tokenization addresses bottlenecks in existing speech generative frameworks, making the model both scalable and adaptable across lingual boundaries. This advancement is pertinent for diverse applications requiring swift and accurate speech synthesis and conversion, including real-time translation and personalized assistive technologies.

Future Directions

Future research could explore further optimization in token compression techniques and investigate the extension of these methods to other languages and dialects, potentially involving cross-modal learning paradigms. Additionally, refining the model for real-time applications and expanding its adaptability to different acoustic environments and languages would be valuable.

Vec-Tok Speech represents a step forward in the synthesis of high-quality, expressive, and adaptive speech, providing a robust framework for multiple speech processing applications while laying the groundwork for integrating speech generation technologies with broader AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Hifi++: a unified framework for neural vocoding, bandwidth extension and speech enhancement. CoRR, abs/2203.13086, 2022.
  2. Voice conversion with just nearest neighbors. CoRR, abs/2305.18975, 2023.
  3. Jason Baldridge. Verbmobil: Foundations of Speech-to-Speech Translation, by wolfgang wahlster (editor). springer, 2000. ISBN 3-540-67783-6. price £44.50 (hardback). xii+679 pages. Nat. Lang. Eng., 10(2):200–204, 2004.
  4. James Betker. Better speech synthesis through scaling. CoRR, abs/2305.07243, 2023.
  5. Audiolm: A language modeling approach to audio generation. IEEE ACM Trans. Audio Speech Lang. Process., 31:2523–2533, 2023a.
  6. Soundstorm: Efficient parallel audio generation. CoRR, abs/2305.09636, 2023b.
  7. Language models are few-shot learners. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  8. Bytedance. Gigas2s: Large scale english-to-x speech-to-speech translation. https://github.com/SpeechTranslation/GigaS2S, 2023.
  9. Yourtts: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  2709–2720. PMLR, 2022.
  10. Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio. In Proc. Interspeech.
  11. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 16(6):1505–1518, 2022.
  12. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021, pp.  244–250. IEEE, 2021.
  13. High fidelity neural audio compression. CoRR, abs/2210.13438, 2022.
  14. ECAPA-TDNN: emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In Proc. Interspeech, pp.  3830–3834. ISCA, 2020.
  15. Philip Gage. A new algorithm for data compression. The C Users Journal archive, 12:23–38, 1994.
  16. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. CoRR, abs/2302.03540, 2023.
  17. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  18. Melgan: Generative adversarial networks for conditional waveform synthesis. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  14881–14892, 2019.
  19. Autoencoding beyond pixels using a learned similarity metric. In Maria-Florina Balcan and Kilian Q. Weinberger (eds.), Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pp.  1558–1566. JMLR.org, 2016.
  20. Voicebox: Text-guided multilingual universal speech generation at scale. CoRR, abs/2306.15687, 2023.
  21. Emotional voice conversion with cycle-consistent adversarial network, 2020.
  22. Translatotron 3: Speech to speech translation with monolingual data. CoRR, abs/2305.17547, 2023.
  23. A time delay neural network architecture for efficient modeling of long temporal contexts. In Interspeech, 2015.
  24. Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio, 2022.
  25. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. CoRR, abs/2304.09116, 2023.
  26. Styles2st: Zero-shot style transfer for direct speech-to-speech translation. CoRR, abs/2305.17732, 2023.
  27. Privacy and utility of x-vector based speaker anonymization. IEEE ACM Trans. Audio Speech Lang. Process., 30:2383–2395, 2022.
  28. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.
  29. Superseded - cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit. 2016.
  30. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023a.
  31. LM-VC: zero-shot voice conversion via speech generation based on language models. IEEE Signal Process. Lett., 30:1157–1161, 2023b.
  32. Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process., 13(4):600–612, 2004.
  33. Speechgen: Unlocking the generative power of speech language models with prompts. CoRR, abs/2306.02207, 2023a.
  34. Audiodec: An open-source streaming high-fidelity neural audio codec. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5, 2023b.
  35. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. CoRR, abs/2305.02765, 2023a.
  36. HYBRIDFORMER: improving squeezeformer with hybrid attention and NSR mechanism. CoRR, abs/2303.08636, 2023b.
  37. Wenet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit. In Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, and Petr Motlícek (eds.), Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pp.  4054–4058. ISCA, 2021.
  38. Gigast: A 10, 000-hour pseudo speech translation corpus. CoRR, abs/2204.03939, 2022.
  39. Deid-vc: Speaker de-identification via zero-shot pseudo voice conversion. In Hanseok Ko and John H. L. Hansen (eds.), Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pp.  2593–2597. ISCA, 2022.
  40. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022.
  41. Libritts: A corpus derived from librispeech for text-to-speech. In Gernot Kubin and Zdravko Kacic (eds.), Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15-19 September 2019, pp.  1526–1530. ISCA, 2019.
  42. WENETSPEECH: A 10000+ hours multi-domain mandarin corpus for speech recognition. In Proc. ICASSP, pp.  6182–6186. IEEE, 2022.
  43. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. CoRR, abs/2303.03926, 2023.
  44. Multi-speaker expressive speech synthesis via multiple factors decoupling. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xinfa Zhu (29 papers)
  2. Yuanjun Lv (12 papers)
  3. Yi Lei (40 papers)
  4. Tao Li (440 papers)
  5. Wendi He (4 papers)
  6. Hongbin Zhou (28 papers)
  7. Heng Lu (41 papers)
  8. Lei Xie (337 papers)
Citations (11)
Github Logo Streamline Icon: https://streamlinehq.com