Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion (2401.11053v5)

Published 19 Jan 2024 in eess.AS and cs.SD

Abstract: Recent LLM (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experiments demonstrate StreamVoice's streaming conversion capability while achieving zero-shot performance comparable to non-streaming VC systems.

Overview of StreamVoice: Streamable Context-Aware LLMing for Real-time Zero-Shot Voice Conversion

StreamVoice presents a notable advancement in the application of LLMs (LM) for zero-shot voice conversion (VC). This model distinguishes itself by achieving streaming capabilities without the need for future look-ahead, enhancing its applicability in real-time voice conversion tasks. Voice conversion encompasses transferring the vocal characteristics of one speaker to another while preserving the linguistic content. Zero-shot voice conversion allows this process with only a single example utterance from the target speaker, broadening the practical applications such as dubbing, privacy, and real-time communication.

Previous zero-shot VC models, particularly those based on LLMs, predominantly operate in offline scenarios due to their dependency on the entire source utterance for conversion. In contrast, StreamVoice distinctively leverages a streaming framework capable of processing inputs temporally, thereby eliminating the dependence on complete source speech and allowing for real-time conversion.

Streamable Architecture

StreamVoice's architecture capitalizes on a fully causal context-aware LM, collaborating with an acoustic predictor to transform semantic input into acoustic representations continuously. This setup allows the model to generate output frame-by-frame without waiting for complete utterance data. The architecture ensures low latency, essential for live applications, achieving a real-time factor faster than 2.4 times real-time on single high-end GPUs like the A100.

Addressing Streaming Challenges

One main challenge in transitioning VC models from offline to streaming is the performance gap due to incomplete contextual information. StreamVoice addresses this with two main strategies:

  1. Teacher-Guided Context Foresight: This method employs a non-streaming automatic speech recognition (ASR) teacher model to predict current and future semantic contexts, guiding the streaming model in producing high-quality conversions despite incomplete inputs.
  2. Semantic Masking Strategy: It promotes context learning by masking portions of the semantic input during training, allowing the model to learn to predict acoustic features from incomplete or corrupted inputs.

Notably, these strategies enhance the underlying model's context-awareness, even when operating in a causal, streaming manner.

Empirical Results and Implications

Experiments show that StreamVoice achieves voice conversion quality on par with non-streaming systems regarding speech naturalness and speaker similarity, while supporting real-time performance. It was noted that subjective and objective evaluations confirmed comparable results between StreamVoice and the non-streaming LM-VC approach, especially under practical constraints of streaming conditions. StreamVoice managed to maintain a low inference latency of 124ms, demonstrating its suitability for real-time applications.

Potential and Future Directions

The paper indicates that improvements could be pursued in domains without sufficient emphasis, such as accented speech or highly emotional utterances, areas where current models, including StreamVoice, encounter declines in performance. Additionally, as the technology's reliance on streaming ASR and speech codecs heavily influences its output, future developments may consider advancing these components to further reduce latency and enhance accuracy.

Overall, StreamVoice is a significant contribution to the field of voice conversion, specifically addressing the need for real-time functionality in zero-shot scenarios. Its innovations in context-aware LLMing for streaming offer a promising avenue for broadening the applications of LMs in real-world speech conversion tasks. As research and technology continue to evolve, models like StreamVoice will likely form the foundation for further advancements in both streaming and offline voice conversion systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Musiclm: Generating music from text. Arxiv, 2023.
  2. Audiolm: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023.
  3. Streaming non-autoregressive model for any-to-many voice conversion. Arxiv, 2022.
  4. A vector quantized approach for text to speech synthesis on real-world spontaneous speech. In Association for the Advancement of Artificial Intelligence, 2023.
  5. Streaming voice conversion via intermediate bottleneck features and non-streaming teacher guidance. In International Conference on Acoustics, Speech and Signal Processing, pages 1–5, 2023.
  6. Neural analysis and synthesis: Reconstructing speech from self-supervised representations. In Neural Information Processing Systems, pages 16251–16265, 2021.
  7. One-shot voice conversion by separating speaker and content representations with instance normalization. In International Speech Communication Association, pages 664–668, 2019.
  8. An Unsupervised Autoregressive Model for Speech Representation Learning. In International Speech Communication Association, pages 146–150, 2019.
  9. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250, 2021.
  10. ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In International Speech Communication Association, pages 3830–3834, 2020.
  11. Contrastive predictive coding supported factorized variational autoencoder for unsupervised learning of disentangled speech representations. In International Conference on Acoustics, Speech and Signal Processing, pages 3860–3864, 2021.
  12. MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features. Arxiv, 2021.
  13. Didispeech: A large scale mandarin speech corpus. In International Conference on Acoustics, Speech and Signal Processing, pages 6968–6972, 2021.
  14. An investigation of streaming non-autoregressive sequence-to-sequence voice conversion. In International Conference on Acoustics, Speech and Signal Processing, pages 6802–6806, 2022.
  15. Fasts2s-vc: Streaming non-autoregressive sequence-to-sequence voice conversion. Arxiv, 2021.
  16. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Arxiv, 2023.
  17. Fast-u2++: Fast and accurate end-to-end speech recognition in joint ctc/attention frames. In International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023.
  18. Streaming end-to-end speech recognition with joint ctc-attention based models. In Automatic Speech Recognition and Understanding Workshop, pages 936–943, 2019.
  19. Dualvc 2: Dynamic masked convolution for unified streaming and non-streaming voice conversion. Arxiv, 2023.
  20. Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219, 2019.
  21. Revisiting over-smoothness in text to speech. In Association for Computational Linguistics, pages 8197–8213, 2022.
  22. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. Arxiv, 2023.
  23. Aishell-3: A multi-speaker mandarin tts corpus and the baselines. Arxiv, 2020.
  24. Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In International Conference on Multimedia and Expo, pages 1–6, 2016.
  25. The NUS & NWPU system for Voice Conversion Challenge 2020. In Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pages 170–174, 2020.
  26. Llama: Open and efficient foundation language models. Arxiv, 2023.
  27. Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. In International Speech Communication Association, pages 1344–1348, 2021.
  28. Alo-vc: Any-to-any low-latency one-shot voice conversion. In International Speech Communication Association, pages 2073–2077, 2023.
  29. Neural codec language models are zero-shot text to speech synthesizers. Arxiv, 2023.
  30. Lm-vc: Zero-shot voice conversion via speech generation based on language models. IEEE Signal Processing Letters, pages 1157–1161, 2023.
  31. Multi-level temporal-channel speaker retrieval for robust zero-shot voice conversion. Arxiv, 2023.
  32. Mirjam Wester. The EMIME bilingual database. Technical report, The University of Edinburgh, 2010.
  33. Audiodec: An open-source streaming high-fidelity neural audio codec. In International Conference on Acoustics, Speech and Signal Processing, pages 1–5, 2023.
  34. Streamable speech representation disentanglement and multi-level prosody modeling for live one-shot voice conversion. In International Speech Communication Association, pages 2578–2582, 2022.
  35. Uniaudio: An audio foundation model toward universal audio generation. Arxiv, 2023.
  36. Add 2022: the first audio deep synthesis detection challenge. In International Conference on Acoustics, Speech and Signal Processing. IEEE, 2022.
  37. Retriever: Learning content-style representation as a token-level bipartite graph. In International Conference on Learning Representations, 2021.
  38. SoundStream: An end-to-end neural audio codec. Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  39. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In International Conference on Acoustics, Speech and Signal Processing, pages 6182–6186, 2022.
  40. Vec-tok speech: speech vectorization and tokenization for neural speech generation, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhichao Wang (83 papers)
  2. Yuanzhe Chen (19 papers)
  3. Xinsheng Wang (33 papers)
  4. Lei Xie (337 papers)
  5. Yuping Wang (56 papers)
Citations (4)