StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion (2401.11053v5)

Published 19 Jan 2024 in eess.AS and cs.SD

Abstract: Recent LLM (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experiments demonstrate StreamVoice's streaming conversion capability while achieving zero-shot performance comparable to non-streaming VC systems.

References (40)

Authors (5)

Zhichao Wang (83 papers)
Yuanzhe Chen (19 papers)
Xinsheng Wang (33 papers)
Lei Xie (337 papers)
Yuping Wang (56 papers)

Citations (4)

View on Semantic Scholar

Summary

Overview of StreamVoice: Streamable Context-Aware LLMing for Real-time Zero-Shot Voice Conversion

StreamVoice presents a notable advancement in the application of LLMs (LM) for zero-shot voice conversion (VC). This model distinguishes itself by achieving streaming capabilities without the need for future look-ahead, enhancing its applicability in real-time voice conversion tasks. Voice conversion encompasses transferring the vocal characteristics of one speaker to another while preserving the linguistic content. Zero-shot voice conversion allows this process with only a single example utterance from the target speaker, broadening the practical applications such as dubbing, privacy, and real-time communication.

Previous zero-shot VC models, particularly those based on LLMs, predominantly operate in offline scenarios due to their dependency on the entire source utterance for conversion. In contrast, StreamVoice distinctively leverages a streaming framework capable of processing inputs temporally, thereby eliminating the dependence on complete source speech and allowing for real-time conversion.

Streamable Architecture

StreamVoice's architecture capitalizes on a fully causal context-aware LM, collaborating with an acoustic predictor to transform semantic input into acoustic representations continuously. This setup allows the model to generate output frame-by-frame without waiting for complete utterance data. The architecture ensures low latency, essential for live applications, achieving a real-time factor faster than 2.4 times real-time on single high-end GPUs like the A100.

Addressing Streaming Challenges

One main challenge in transitioning VC models from offline to streaming is the performance gap due to incomplete contextual information. StreamVoice addresses this with two main strategies:

Teacher-Guided Context Foresight: This method employs a non-streaming automatic speech recognition (ASR) teacher model to predict current and future semantic contexts, guiding the streaming model in producing high-quality conversions despite incomplete inputs.
Semantic Masking Strategy: It promotes context learning by masking portions of the semantic input during training, allowing the model to learn to predict acoustic features from incomplete or corrupted inputs.

Notably, these strategies enhance the underlying model's context-awareness, even when operating in a causal, streaming manner.

Empirical Results and Implications

Experiments show that StreamVoice achieves voice conversion quality on par with non-streaming systems regarding speech naturalness and speaker similarity, while supporting real-time performance. It was noted that subjective and objective evaluations confirmed comparable results between StreamVoice and the non-streaming LM-VC approach, especially under practical constraints of streaming conditions. StreamVoice managed to maintain a low inference latency of 124ms, demonstrating its suitability for real-time applications.

Potential and Future Directions

The paper indicates that improvements could be pursued in domains without sufficient emphasis, such as accented speech or highly emotional utterances, areas where current models, including StreamVoice, encounter declines in performance. Additionally, as the technology's reliance on streaming ASR and speech codecs heavily influences its output, future developments may consider advancing these components to further reduce latency and enhance accuracy.

Overall, StreamVoice is a significant contribution to the field of voice conversion, specifically addressing the need for real-time functionality in zero-shot scenarios. Its innovations in context-aware LLMing for streaming offer a promising avenue for broadening the applications of LMs in real-world speech conversion tasks. As research and technology continue to evolve, models like StreamVoice will likely form the foundation for further advancements in both streaming and offline voice conversion systems.

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1749655177764757863

https://twitter.com/ArxivSound/status/1755457315992416285

https://twitter.com/fly51fly/status/1751385548646998035

https://twitter.com/WilliamLamkin/status/1749793826078605531

https://twitter.com/MLexpAI/status/1749795444169871613

https://twitter.com/kevinmcld/status/1774903223213535502