Continuous Speech Synthesis using per-token Latent Diffusion (2410.16048v1)

Published 21 Oct 2024 in eess.AS

Abstract: The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

Authors (7)

Arnon Turetzky (5 papers)
Nimrod Shabtay (7 papers)
Slava Shechtman (9 papers)
Hagai Aronowitz (8 papers)
David Haws (16 papers)
Ron Hoory (15 papers)
Avihu Dekel (8 papers)

Summary

Continuous Speech Synthesis Using Per-Token Latent Diffusion: An Overview

The research presented addresses continuous speech synthesis via per-token latent diffusion, introducing SALAD as an advanced methodology for zero-shot text-to-speech (TTS). This approach leverages the recent success of autoregressive transformer models and proposes a shift from discrete to continuous representations to enhance reconstruction quality.

Key Contributions

SALAD Framework: The paper introduces SALAD (Speech synthesis with Autoregressive LAtent Diffusion), a novel model that generates speech by modeling continuous distributions through per-token latent diffusion. This approach builds on multimodal representation techniques to produce variable-length outputs without explicit text-audio alignments.
Semantic Token Utilization: It employs semantic tokens to provide contextual information and define generation stopping conditions, enhancing the model's adaptability to complex speech synthesis tasks.
Continuous vs. Discrete Modeling: A thorough comparative analysis between continuous and discrete speech modeling techniques is conducted, with findings suggesting that SALAD consistently delivers superior intelligibility scores and maintains quality and speaker similarity comparable to ground truth audio.
Variants of the Method: The paper presents three distinct methodologies within the SALAD framework:
- T2A (Text2Acoustic): Directly predicts acoustic features from text, incorporating semantic tokens as an auxiliary task.
- S2A-AR (Semantic2Acoustic Autoregressive): Focuses on predicting acoustic features through next-token prediction based on semantic tokens.
- S2A-NAR (Semantic2Acoustic Non-Autoregressive): Utilizes a MaskGIT schedule for acoustic feature prediction from semantic tokens.

Empirical Evaluation

The research evaluates SALAD across multiple measures: UTMOS for audio quality, CER for intelligibility, and cosine similarity for speaker verification, using a substantial dataset drawn from MLS-English. The results underscore the efficacy of continuous representations and reinforce the potential of SALAD as a competitive solution for high-fidelity TTS tasks.

Discussion on Practical and Theoretical Implications

From a practical standpoint, SALAD's approach of utilizing continuous latent diffusion offers a promising alternative to traditional discrete systems, potentially leading to more robust speech synthesis in variable contexts. This model avoids issues of codebook noise quantization evident in discrete systems, showcasing improved adaptability to inherently continuous modalities like audio.

Theoretically, the exploration of diffusion processes in TTS opens avenues for expansive research into multimodal generation tasks, particularly where continuous distributions are involved. By addressing challenges in modeling complex distributions, this research could influence subsequent advancements in text-to-speech synthesis and related domains, potentially informing the development of more efficient and versatile AI-driven speech technologies.

Future Directions

Future work might focus on refining the inference processes, such as optimizing the number of diffusion steps for efficiency, or developing advanced techniques for stopping condition determination that do not rely on discrete components. Further exploration into multimodal models leveraging continuous representations could also unveil enhanced performance in diverse AI applications.

In conclusion, this paper presents a substantial advancement in speech synthesis, leveraging continuous modalities through innovative modeling techniques. SALAD demonstrates tangible improvements in intelligibility and audio quality, marking a significant contribution to both theoretical explorations and practical implementations of TTS systems.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/AvihuDkl/status/1850803242357195096

https://twitter.com/javaeeeee1/status/1850837178781139360

https://twitter.com/arXivGPT/status/1851332347662094531