Papers
Topics
Authors
Recent
Search
2000 character limit reached

Chunk-Aware Causal Flow Matching

Updated 27 November 2025
  • The paper introduces chunk-aware causal flow matching, integrating optimal transport-based neural ODEs to achieve near-human TTS quality with low latency.
  • It employs chunk-based attention masking and a causal convolutional Transformer U-Net to seamlessly support both streaming and offline synthesis.
  • Evaluation on SEED benchmarks shows error rates as low as 1.45% CER in Chinese and 2.38% WER in English, demonstrating robust performance across diverse scenarios.

Chunk-aware causal flow matching (CFM) is a generative modeling framework central to the CosyVoice 2 text-to-speech (TTS) system, enabling high-quality, low-latency streaming speech synthesis by segmenting target mel-spectrograms into chunks and modeling their temporal progression with causally-masked optimal transport-based neural ordinary differential equations. CFM combines robust flow-matching objectives, chunk-based attention masking, and a causal convolutional Transformer U-Net architecture to harmonize streaming and offline TTS within a unified framework, achieving near-human naturalness and virtually lossless streaming fidelity (Du et al., 2024).

1. Flow Matching Objective and Mathematical Formulation

At its core, chunk-aware causal flow matching constructs a deterministic flow, parameterized by t[0,1]t \in [0, 1], between a standard Gaussian prior X0N(0,I)X_0 \sim \mathcal N(0,I) and a data-sampled mel-spectrogram X1RT×DX_1 \in \mathbb R^{T \times D}, using the optimal transport (OT) interpolation:

ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.

The ground-truth time-dependent vector field is

ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.

A neural Flow Matcher νθ\nu_\theta is trained to predict ωt\omega_t, conditioned on interpolated states Xt=ϕtOT(X0,X1)X_t = \phi^{OT}_t(X_0, X_1) and auxiliary information Ψ={μ1:L,X~1,v}\Psi = \{\mu_{1:L}, \tilde X_1, \mathbf v\}, where μ1:L\mu_{1:L} are upsampled speech tokens from a pretrained LLM, X0N(0,I)X_0 \sim \mathcal N(0,I)0 is a heavily masked version of X0N(0,I)X_0 \sim \mathcal N(0,I)1, and X0N(0,I)X_0 \sim \mathcal N(0,I)2 is the speaker embedding. The X0N(0,I)X_0 \sim \mathcal N(0,I)3 flow-matching loss is

X0N(0,I)X_0 \sim \mathcal N(0,I)4

Inference discretizes the ODE into X0N(0,I)X_0 \sim \mathcal N(0,I)5 steps using cosine time re-parameterization

X0N(0,I)X_0 \sim \mathcal N(0,I)6

and an Euler update

X0N(0,I)X_0 \sim \mathcal N(0,I)7

Classifier-free guidance is implemented by randomly dropping X0N(0,I)X_0 \sim \mathcal N(0,I)8 at training and assembling predictions at inference as

X0N(0,I)X_0 \sim \mathcal N(0,I)9

2. Chunking and Causal Masking Strategies

To facilitate streaming, the CFM processes only small contiguous segments (“chunks”) of output frames at a time, never seeing the full mel sequence. Each chunk consists of X1RT×DX_1 \in \mathbb R^{T \times D}0 frames, typically matched to 50 Hz audio (so X1RT×DX_1 \in \mathbb R^{T \times D}1 frames equals X1RT×DX_1 \in \mathbb R^{T \times D}2 seconds). Four attention masks govern context access during training:

  • Non-causal mask: full context (offline)
  • Full-causal mask: only past frames (no look-ahead)
  • Chunk-X1RT×DX_1 \in \mathbb R^{T \times D}3 mask: past plus X1RT×DX_1 \in \mathbb R^{T \times D}4 future frames
  • Chunk-X1RT×DX_1 \in \mathbb R^{T \times D}5 mask: past plus X1RT×DX_1 \in \mathbb R^{T \times D}6 future frames

Masking mode is selected uniformly per mini-batch sample, enforcing model robustness to variable look-ahead and chunk boundaries. At inference, a look-ahead buffer of X1RT×DX_1 \in \mathbb R^{T \times D}7 frames from the preceding chunk is prepended to each chunk. The same masking scheme is enforced, producing seamless synthesis across chunk boundaries.

3. Network Architecture: Causal Convolutional Transformer U-Net

The flow matcher X1RT×DX_1 \in \mathbb R^{T \times D}8 adopts a causal–convolutional Transformer U-Net architecture. The main modules include:

  • Input preprocessing: Semantic tokens X1RT×DX_1 \in \mathbb R^{T \times D}9 are upsampled by ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.0 to reach 50 Hz; a 1D look-ahead convolution with kernel size ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.1 and right padding ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.2 facilitates limited future-frame access.
  • Chunk-aware causal Transformer blocks: Causal multi-head self-attention with current chunk’s mask, cross-attention to upsampled tokens, a local feed-forward network, residual/layer-norm—all strictly causal.
  • U-Net pathway: Down- and up-sampling paths with skip connections, housing identical causal conv-Transformer blocks at every resolution.
  • Conditioning mechanisms: Sinusoidal embeddings of ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.3 (injected at all layers), speaker embedding ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.4, and masked mel ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.5 (projected as bias terms).
  • Final projection: Outputs ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.6 as a ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.7-dimensional Mel frame.

Integration with the upstream LM (e.g., Qwen2.5) relies on a convolution+linear embedding mapping each semantic token to the same feature dimension ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.8 before cross-attention.

4. Inference Procedures for Offline and Streaming Synthesis

CosyVoice 2 supports both offline and streaming inference regimes:

  • Offline (non-streaming):
  1. Full text is processed by text-speech LM to produce ϕtOT(X0,X1)=(1t)X0+tX1.\phi^{OT}_t(X_0,X_1) = (1-t)\,X_0 + t\,X_1\,.9.
  2. Build full conditioning set ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.0.
  3. Initialize ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.1.
  4. For each step ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.2, compute ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.3, update ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.4, and pass final ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.5 to the vocoder.
  • Streaming (chunk-by-chunk):
  1. For chunk ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.6, query LM for next ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.7 tokens ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.8 and upsample.
  2. Prepend ωt(ϕtOT(X0,X1))=X1X0.\omega_t(\phi^{OT}_t(X_0,X_1)) = X_1 - X_0\,.9 frames from prior chunk as look-ahead.
  3. Form chunk input (νθ\nu_\theta0 frames) and apply appropriate causal/chunk mask.
  4. Run νθ\nu_\theta1 flow-matching steps restricted to chunk frames.
  5. Discard the first νθ\nu_\theta2 frames; send νθ\nu_\theta3 frames to the vocoder and concatenate outputs.

Streaming TTS latency is formalized as

νθ\nu_\theta4

with νθ\nu_\theta5 denoting per-token flow matcher time and empirical head-of-line streaming latency remaining under 40 ms.

5. Training Procedures and Data Regimen

Training occurs on a corpus comprising approximately 130k hours of Chinese, 30k hours of English, and small amounts of Japanese/Korean speech, all at 50 Hz frame rate and νθ\nu_\theta6 Mel channels. For each example:

  • 70–100% of final frames in νθ\nu_\theta7 are randomly masked to yield νθ\nu_\theta8.
  • Attention mask (non-causal, full-causal, chunk-νθ\nu_\theta9, chunk-ωt\omega_t0) is selected uniformly.
  • Batch size is 256 chunks per GPU, optimized with AdamW and standard learning rate scheduling.
  • Loss is ωt\omega_t1 flow matching; ωt\omega_t2 flow steps, classifier-free guidance strength ωt\omega_t3.
  • Upstream semantic token quantization uses finite-scalar quantization (FSQ) with ωt\omega_t4 dimensions, codebook ωt\omega_t5.

6. Evaluation, Quantitative Results, and Ablation Studies

On the SEED benchmarks, CosyVoice 2 with chunk-aware CFM attains near-human performance in both offline and streaming scenarios:

Dataset Mode Error Rate Speaker Sim. (SS)
test-zh Offline 1.45% CER 0.806
test-zh Streaming 1.45% CER 0.812
test-en Offline 2.57% WER 0.736
test-en Streaming 2.38% WER 0.743
test-hard Offline 6.83% 0.776
test-hard Streaming 8.08% 0.785

Ablation (Table 7 in paper):

  • Streaming LM only (+offline CFM): ωt\omega_t6 CER on test-zh, ωt\omega_t7 on test-hard.
  • Streaming CFM only (+offline LM): ωt\omega_t8 CER on test-zh, ωt\omega_t9 on test-hard.
  • Both streaming: at most Xt=ϕtOT(X0,X1)X_t = \phi^{OT}_t(X_0, X_1)0 on test-zh, Xt=ϕtOT(X0,X1)X_t = \phi^{OT}_t(X_0, X_1)1 on test-hard.

This demonstrates that chunk-aware CFM preserves quality and consistency across streaming/non-streaming settings and diverse benchmarks.

7. Advantages, Limitations, and Prospective Extensions

Advantages

  • A unified model supports both streaming and offline TTS synthesis.
  • Streaming quality is virtually lossless, first-package latencies of tens of milliseconds.
  • Chunk-aware masking ensures robustness to varied look-ahead, encourages implicit self-distillation.
  • Decoupled modeling (semantic via LM, acoustic via CFM): streaming semantic token input does not degrade speaker fidelity.

Limitations

  • Languages with overlapping character sets (e.g., Japanese and Chinese) show elevated error rates.
  • No explicit control over timbre or pitch by text instruction.
  • Singing and highly rhythmic speech remain problematic.

Potential extensions

  • Application to fully non-autoregressive TTS (bypassing discrete tokens).
  • Variable chunk lengths and adaptive look-ahead schemes.
  • Hybrid samplers combining chunk-aware flow matching with diffusion models.
  • Enabling multi-modal (e.g., visual or gestural) streaming conditioning in generative agents (Du et al., 2024).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Chunk-Aware Causal Flow Matching.