Cross-Lingual Neural Codec LLMing for Speech Synthesis
The presented paper addresses a significant challenge in the field of speech synthesis, extending the capabilities of neural models to enable cross-lingual text-to-speech (TTS) and speech-to-speech translation (S2ST) while retaining the original speaker’s voice characteristics. The authors introduce VALL-E X, a cross-lingual neural codec LLM, which builds upon the previously established VALL-E framework, designed for monolingual TTS synthesis.
Framework and Methodology
The VALL-E X framework is structured around two primary components: a multilingual autoregressive codec LLM and a multilingual non-autoregressive codec LLM. These components leverage large-scale multilingual speech-transcription datasets, such as LibriLight and WenetSpeech, enabling the generation of high-quality speech that preserves the original speaker's voice, emotion, and acoustic characteristics across different languages.
A core aspect of the VALL-E X model is its ability to perform zero-shot cross-lingual TTS and S2ST. For cross-lingual TTS, the model predicts the target language's acoustic tokens using the source language's speech as a reference. The unique architecture of VALL-E X allows it to successfully generate speech in the target language with a foreign speaker's voice without requiring parallel multilingual datasets with the same speaker.
The speech-to-speech translation task is achieved through a novel speech recognition and translation model based on an improved version of SpeechUT. This approach translates semantic content across languages, integrating phonemic and acoustic token sequences to produce the desired output.
Experimental Evaluation
In an extensive set of experiments, VALL-E X was evaluated on tasks that include zero-shot cross-lingual TTS and zero-shot S2ST, using datasets like LibriSpeech and the EMIME corpus. The evaluation was based on various performance metrics including speaker similarity, ASR-Word Error Rate (WER), ASR-BLEU, and overall speech naturalness.
The results demonstrate that VALL-E X delivers substantial improvements over existing methods, particularly in maintaining speaker similarity and reducing WER and accent problems. For instance, VALL-E X achieved a significant reduction in WER in cross-lingual English TTS tasks and outperformed state-of-the-art models in S2ST tasks in terms of BLEU scores and speech naturalness.
Implications and Future Directions
The implications of VALL-E X are profound in both practical and theoretical aspects of speech synthesis. Practically, the model facilitates applications like personalized virtual assistants and multilingual communication aids, enhancing their ability to convey messages across languages without losing individual speaker identity. Theoretically, this work pushes the boundaries of what neural models can achieve in the field of multilingual text and speech synthesis.
Moreover, this research sets the stage for future work that might expand VALL-E X to additional languages and explore more intricate tasks such as emotion expression and code-switching in synthesized speech. These extensions can lead to even more robust and versatile applications in AI-driven communication systems.
Overall, VALL-E X embodies a significant advancement in cross-lingual speech processing, providing a robust framework to achieve seamless and high-fidelity speech synthesis across languages while retaining speaker individuality. The potential for future exploration and application is vast, heralding new opportunities for research and development within the field.