CLEAR: Continuous Latent Autoregressive Model
- CLEAR is a unified autoregressive framework for speech synthesis that leverages continuous latent representations to generate high-quality and natural audio.
- It integrates an enhanced variational autoencoder with an MLP-based rectified flow head to efficiently predict denoising vector fields for compact latent sequences.
- CLEAR achieves state-of-the-art performance with low word error rates, high naturalness, and streaming capability, making it ideal for real-time and resource-constrained scenarios.
The Continuous Latent Autoregressive Model (CLEAR) is a unified autoregressive framework for speech synthesis that directly models continuous latent representations of audio waveforms using a hybrid of enhanced variational autoencoding and efficient flow-based generative modeling. CLEAR addresses key inefficiencies of discrete-token AR text-to-speech (TTS) systems, delivering high-quality, robust, and low-latency speech synthesis suitable for streaming and resource-constrained applications (Wu et al., 26 Aug 2025).
1. Architectural Overview
CLEAR consists of two principal components:
- Enhanced Variational Autoencoder (wav-VAE):
- Raw audio sampled at 16 kHz is mapped into a compact sequence of continuous latent vectors.
- The encoder and decoder are convolutional neural networks augmented with shortcut connections (space-to-channel reshaping and channel averaging). This mechanism maintains reconstruction fidelity at high compression ratios, enabling latent sequences as short as 7.8 per second (downsampling factor 2048), far shorter than conventional mel-spectrogram or discrete token sequences.
- Autoregressive Modeling via Rectified Flows:
- A unidirectional Transformer (AR LLM) produces a conditioning vector for each autoregressive timestep .
- For each , a lightweight MLP-based rectified flow head estimates the conditional distribution over continuous latents. The rectified flow head operates independently for each time step and is trained with a joint denoising loss:
- Here, is the ground-truth displacement in latent space, indexes a noise schedule for regularizing intermediate steps, and is the flow vector predicted by the head. An auxiliary cosine similarity loss aligns predicted and true vector fields.
This single-stage joint architecture removes the need for multi-phase or cascaded systems found in discrete-token TTS (e.g., coarse token AR generation followed by diffusion refinement), eliminating error accumulation and supporting streaming synthesis.
2. Continuous Latent Space: Compression and Efficiency
CLEAR's core contribution is its design around continuous latent spaces. By avoiding quantization,
- Compression Ratio: The enhanced VAE achieves a downsampling factor of 2048, yielding a latent sequence much shorter than mel-spectrogram or VQ-VAE token approaches.
- Information Preservation: Unlike discrete tokenization in VALL-E and similar models, which necessitate longer sequences and often lossy discretization, CLEAR maintains higher fidelity and naturalness with extremely concise sequences.
- Low-latency: The AR model need only process a handful of latent steps per utterance, so inference is fast, and real-time streaming becomes feasible.
3. Generative Modeling with Rectified Flow
The MLP-based rectified flow head implements a continuous-density modeling approach. For each latent frame, it conditions on the AR Transformer output and predicts an ODE-driven stochastic vector field for denoising and sample generation:
- The latent distribution can be sampled by integrating the predicted flow over time, initialized from noise, yielding high-quality reconstructions with few steps per latent.
- The independence of the flow head for each latent enables streaming synthesis: decoding can begin as soon as the first Transformer output is available, and subsequent latents are synthesized in parallel with AR progression.
4. Quantitative Performance and Evaluation
CLEAR demonstrates strong empirical results on standard benchmarks:
- Robustness (Word Error Rate): On LibriSpeech test-clean, CLEAR-Large achieves a WER of 1.88%.
- Speaker Similarity: Evaluated with speaker embedding similarity (SIM-o), CLEAR provides competitive scores, with small gaps relative to discrete-token systems that inject explicit speaker information.
- Naturalness: UTMOS scores (4.21–4.26) signify natural-sounding, near-human speech.
- Inference Efficiency (RTF): CLEAR-Base reaches an RTF of 0.18, and CLEAR-Large 0.29, both substantially lower than prior state-of-the-art systems.
- Streaming Latency: First-frame delay as low as 96ms, supported by independent per-latent flow heads, enables streaming applications.
5. Comparative Analysis with Discrete-Token TTS Systems
CLEAR contrasts with mainstream AR TTS systems such as VALL-E, CosyVoice, and F5-TTS:
- Sequence Length: Discrete-token models require hundreds of tokens per second, resulting in slow inference and compounded error from sequencing.
- Architecture Complexity: Cascaded or coarse-to-fine systems increase parameter count and technical debt; CLEAR's single-stage AR + flow paradigm sidesteps these complexities.
- Efficiency: Lower RTF and sequence length translate to reduced computation and memory footprint.
- Quality: While speaker similarity may appear marginally lower in objective metrics compared to models using explicit speaker tokens, subjective scores for intelligibility and naturalness are competitive with best-in-class models.
6. Latency, Streaming Capabilities, and Applications
CLEAR is engineered for deployment scenarios requiring rapid response and scalability:
- Streaming Support: The short latent sequence and flow-based independence enable streaming TTS with negligible initial delay.
- Resource-Constrained Environments: The architectural efficiency, reduced computational demand, and memory savings facilitate real-time synthesis on minimal hardware.
- Applications: CLEAR suits interactive dialogue systems, virtual assistants, live broadcast environments, and accessibility technologies.
7. Broader Implications and Future Directions
CLEAR's continuous latent autoregressive approach marks an inflection point in TTS modeling:
- Scalability: The simplified pipeline allows larger-scale models to be trained efficiently.
- Integrations: Future research may incorporate direct speaker embeddings within the AR LLM and diffusion head to further enhance speaker fidelity.
- Generalization: The continuous paradigm is adaptable to multilingual, multi-style, and multi-speaker contexts.
- Ethics: The increased fidelity and speaker mimicking capabilities, while enabling positive applications (assistive technologies, education), necessitate mechanisms for watermarking, verification, and responsible usage policies.
In conclusion, CLEAR synthesizes essential advances in continuous representation learning, autoregressive modeling, and flow-based density estimation, achieving state-of-the-art robustness, naturalness, and efficiency in zero-shot TTS, with substantial headroom for further research in compact, low-latency and streaming speech generation (Wu et al., 26 Aug 2025).