StreamVoice: Streamable Context-Aware Language Modeling for Real-time Zero-Shot Voice Conversion (2401.11053v5)
Abstract: Recent LLM (LM) advancements have showcased impressive zero-shot voice conversion (VC) performance. However, existing LM-based VC models usually apply offline conversion from source semantics to acoustic features, demanding the complete source speech and limiting their deployment to real-time applications. In this paper, we introduce StreamVoice, a novel streaming LM-based model for zero-shot VC, facilitating real-time conversion given arbitrary speaker prompts and source speech. Specifically, to enable streaming capability, StreamVoice employs a fully causal context-aware LM with a temporal-independent acoustic predictor, while alternately processing semantic and acoustic features at each time step of autoregression which eliminates the dependence on complete source speech. To address the potential performance degradation from the incomplete context in streaming processing, we enhance the context-awareness of the LM through two strategies: 1) teacher-guided context foresight, using a teacher model to summarize the present and future semantic context during training to guide the model's forecasting for missing context; 2) semantic masking strategy, promoting acoustic prediction from preceding corrupted semantic and acoustic input, enhancing context-learning ability. Notably, StreamVoice is the first LM-based streaming zero-shot VC model without any future look-ahead. Experiments demonstrate StreamVoice's streaming conversion capability while achieving zero-shot performance comparable to non-streaming VC systems.
- Musiclm: Generating music from text. Arxiv, 2023.
- Audiolm: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533, 2023.
- Streaming non-autoregressive model for any-to-many voice conversion. Arxiv, 2022.
- A vector quantized approach for text to speech synthesis on real-world spontaneous speech. In Association for the Advancement of Artificial Intelligence, 2023.
- Streaming voice conversion via intermediate bottleneck features and non-streaming teacher guidance. In International Conference on Acoustics, Speech and Signal Processing, pages 1–5, 2023.
- Neural analysis and synthesis: Reconstructing speech from self-supervised representations. In Neural Information Processing Systems, pages 16251–16265, 2021.
- One-shot voice conversion by separating speaker and content representations with instance normalization. In International Speech Communication Association, pages 664–668, 2019.
- An Unsupervised Autoregressive Model for Speech Representation Learning. In International Speech Communication Association, pages 146–150, 2019.
- w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250, 2021.
- ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification. In International Speech Communication Association, pages 3830–3834, 2020.
- Contrastive predictive coding supported factorized variational autoencoder for unsupervised learning of disentangled speech representations. In International Conference on Acoustics, Speech and Signal Processing, pages 3860–3864, 2021.
- MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features. Arxiv, 2021.
- Didispeech: A large scale mandarin speech corpus. In International Conference on Acoustics, Speech and Signal Processing, pages 6968–6972, 2021.
- An investigation of streaming non-autoregressive sequence-to-sequence voice conversion. In International Conference on Acoustics, Speech and Signal Processing, pages 6802–6806, 2022.
- Fasts2s-vc: Streaming non-autoregressive sequence-to-sequence voice conversion. Arxiv, 2021.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Arxiv, 2023.
- Fast-u2++: Fast and accurate end-to-end speech recognition in joint ctc/attention frames. In International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023.
- Streaming end-to-end speech recognition with joint ctc-attention based models. In Automatic Speech Recognition and Understanding Workshop, pages 936–943, 2019.
- Dualvc 2: Dynamic masked convolution for unified streaming and non-streaming voice conversion. Arxiv, 2023.
- Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219, 2019.
- Revisiting over-smoothness in text to speech. In Association for Computational Linguistics, pages 8197–8213, 2022.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. Arxiv, 2023.
- Aishell-3: A multi-speaker mandarin tts corpus and the baselines. Arxiv, 2020.
- Phonetic posteriorgrams for many-to-one voice conversion without parallel data training. In International Conference on Multimedia and Expo, pages 1–6, 2016.
- The NUS & NWPU system for Voice Conversion Challenge 2020. In Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, pages 170–174, 2020.
- Llama: Open and efficient foundation language models. Arxiv, 2023.
- Vqmivc: Vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. In International Speech Communication Association, pages 1344–1348, 2021.
- Alo-vc: Any-to-any low-latency one-shot voice conversion. In International Speech Communication Association, pages 2073–2077, 2023.
- Neural codec language models are zero-shot text to speech synthesizers. Arxiv, 2023.
- Lm-vc: Zero-shot voice conversion via speech generation based on language models. IEEE Signal Processing Letters, pages 1157–1161, 2023.
- Multi-level temporal-channel speaker retrieval for robust zero-shot voice conversion. Arxiv, 2023.
- Mirjam Wester. The EMIME bilingual database. Technical report, The University of Edinburgh, 2010.
- Audiodec: An open-source streaming high-fidelity neural audio codec. In International Conference on Acoustics, Speech and Signal Processing, pages 1–5, 2023.
- Streamable speech representation disentanglement and multi-level prosody modeling for live one-shot voice conversion. In International Speech Communication Association, pages 2578–2582, 2022.
- Uniaudio: An audio foundation model toward universal audio generation. Arxiv, 2023.
- Add 2022: the first audio deep synthesis detection challenge. In International Conference on Acoustics, Speech and Signal Processing. IEEE, 2022.
- Retriever: Learning content-style representation as a token-level bipartite graph. In International Conference on Learning Representations, 2021.
- SoundStream: An end-to-end neural audio codec. Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In International Conference on Acoustics, Speech and Signal Processing, pages 6182–6186, 2022.
- Vec-tok speech: speech vectorization and tokenization for neural speech generation, 2023.
- Zhichao Wang (83 papers)
- Yuanzhe Chen (19 papers)
- Xinsheng Wang (33 papers)
- Lei Xie (337 papers)
- Yuping Wang (56 papers)