Shallow Flow Matching in TTS

Updated 6 March 2026

Shallow Flow Matching (SFM) is a generative modeling framework that refines flow matching by integrating shallow intermediate states along conditional optimal transport paths.
It employs a dual-component system combining a coarse TTS generator with an SFM head to construct and align intermediate representations for efficient signal generation.
SFM improves inference speed and synthesis quality as evidenced by enhanced PMOS and reduced WER, offering up to a 60% acceleration in computational performance.

Shallow Flow Matching (SFM) is a generative modeling framework that modifies the standard flow matching methodology by introducing intermediate ("shallow") states along deterministic or stochastic probability flow paths. Originating in the context of speech synthesis, SFM addresses inefficiencies and limitations inherent in conventional flow matching (FM) approaches by adaptively determining where to begin integration on the conditional @@@@1@@@@ (CondOT) path, and constructing a principled single-segment piecewise flow. The approach generalizes to any CondOT-based FM configuration and is applicable to diverse domains utilizing coarse-to-fine generation paradigms, notably text-to-speech (TTS) synthesis (Yang et al., 18 May 2025).

1. Mathematical Formulation of Shallow Flow Matching

SFM extends conventional conditional flow matching by leveraging an intermediate state $X_{\tilde t_h}$ , constructed via projection from a coarse generator's output onto the CondOT trajectory. The conventional FM path between a standard Gaussian prior $p_0(X_0)=\mathcal N(0, I)$ and data sample $X_1$ of mel-spectrograms is given by $p_t(X_t|X_1) = \mathcal N(\mu_t(X_1), \sigma_t(X_1)^2 I)$ , with $\mu_t(X_1) = t X_1$ , $\sigma_t(X_1) = 1 - (1 - \sigma_{\min}) t$ . The flow is $\phi_t(X_0) = (1-t) X_0 + t(X_1 + \sigma_{\min} X_0)$ , and vector field $u_t(X_t | X_1) = (X_1 + \sigma_{\min} X_0) - X_0$ .

SFM introduces a split at $t_m \in (0,1)$ , with the intermediate point $x_{t_m} = (1 - t_m) X_0 + t_m (X_1 + \sigma_{\min} X_0)$ . The remaining segment is rescaled to $[0,1]$ via $t_S = \frac{t - t_m}{1 - t_m}$ , so that for $t \geq t_m$ , the "shallow" flow is

$x_t = (1 - t_S) x_{t_m} + t_S (X_1 + \sigma_{\min} X_0),$

with velocity field

$U_t = \frac{1}{1 - t_m}\left[ (X_1 + \sigma_{\min} X_0) - x_{t_m} \right].$

This defines a single-segment, piecewise vector field utilized during both SFM training and inference (Yang et al., 18 May 2025).

2. Construction of Intermediate States

The mechanism for intermediate state construction involves a two-component system: a coarse TTS generator $g_\omega(C)$ (conditioned on text or speaker embeddings) produces high-level hidden states $H_g$ and a coarse mel-spectrogram $X_g$ , while a lightweight SFM head $h_\psi(H_g)$ predicts $(X_h, \hat t_h, \log \hat \sigma_h^2)$ . Here, $X_h$ (intermediate state), $\hat t_h$ (temporal position), and $\hat \sigma_h^2$ (variance) are inferred per frame and aggregated.

Projection of $X_h$ onto the CondOT path is performed via orthogonal projection: $t_h = \mathbb E \left[ \frac{X_h \cdot X_1}{X_1 \cdot X_1} \right], \quad \sigma_h^2 = \mathbb E [\| X_h - t_h X_1 \|^2].$ Then, employing Theorem 1 of (Yang et al., 18 May 2025), the exact CondOT-aligned state is found using

$\Delta = \max((1-\sigma_{\min}) t_h + \sigma_h, 1),\quad \tilde X_h = X_h/\Delta,\quad \tilde t_h = t_h/\Delta,\quad \tilde \sigma_h^2 = \sigma_h^2/\Delta^2,$

and sampling

$X_{\tilde t_h} = \sqrt{\max((1-(1-\sigma_{\min}) \tilde t_h)^2 - \tilde \sigma_h^2, 0)} \cdot X_0 + \tilde X_h, \quad X_0 \sim \mathcal N(0, I).$

This intermediate state is used as the starting point for downstream ODE integration.

3. Training Objective and Algorithmic Workflow

SFM's training loss $L_{\mathrm{SFM}}$ consolidates both standard FM and auxiliary objectives to supervise the construction of the intermediate state and the estimation of its location:

Coarse mel L2-loss: $L_{\mathrm{coarse}} = \mathbb E \| X_g - X_1 \|^2$
Orthogonal projection loss: $L_\mu = \mathbb E \| X_h - t_h X_1 \|^2$
Time prediction loss: $L_t = \mathbb E (\hat t_h - \tilde t_h)^2$
Variance prediction loss: $L_\sigma = \mathbb E (\hat \sigma_h^2 - \tilde \sigma_h^2)^2$
Shallow flow matching loss: For $t_S \sim S$ (scheduler), $L_{\mathrm{CFM}} = \mathbb E_{t_S} \|v_\theta(X_t, t) - U_t\|^2$ , where $X_t = (1 - t_S) X_{\tilde t_h} + t_S (X_1 + \sigma_{\min} X_0)$ , $t = t_m + (1 - t_m)t_S$ .

Total loss: $L_\mathrm{SFM} = L_\mathrm{coarse} + L_\mu + L_t + L_\sigma + L_\mathrm{CFM}$ Gradient-based optimization is performed on this objective, with detailed stepwise pseudocode enumerated in (Yang et al., 18 May 2025).

4. Inference Procedure and Computational Advantages

Inference in SFM is characterized by its initialization from the learned intermediate state $X_{\tilde t_h}$ , rather than white noise, focusing computation on the "latter" segment of the CondOT path. The procedure is as follows:

Generate $(H_g,X_g) = g_\omega(C)$ .
Compute $(X_h, \hat t_h, \log \hat \sigma_h^2) = h_\psi(H_g)$ .
For SFM strength $\alpha \geq 1$ , form rescaled variables as above.
Sample $X_0 \sim \mathcal N(0,I)$ and generate $X_{\tilde t_h}$ .
Solve $dX/dt = v_\theta(X, t)$ for $t \in [\tilde t_h, 1]$ , outputting $X(1)$ .

This higher-SNR initialization dramatically reduces the number of function evaluations required by adaptive ODE solvers. SFM with $\alpha=5$ yields accelerations of $48\%$ – $60\%$ compared to vanilla CFM using Dopri(5), Bogacki–Shampine(3), and other solvers on LJ Speech (Yang et al., 18 May 2025).

5. Integration with TTS Architectures

The SFM head is integrated as a light module after the coarse generator. It consists of two 1D-convolutional layers with ReLU and LayerNorm, followed by a linear layer outputting three channels per frame. These correspond to $X_h$ , $t_{\text{raw}}$ , and $\log \sigma^2_{\text{raw}}$ , with $t_{\text{raw}}$ and $\log \sigma^2_{\text{raw}}$ subsequently mean-pooled and post-processed. Both the coarse and SFM heads are jointly trained until convergence, but only the SFM head and learned vector field $v_\theta$ are required for inference (Yang et al., 18 May 2025).

6. Empirical Results

Quantitative assessment on LJ Speech, VCTK, and LibriTTS corpora demonstrates that SFM produces consistent improvements in synthesized speech naturalness, as measured by pseudo-MOS (PMOS) and word error rate (WER), across multiple TTS backbones (Matcha-TTS, StableTTS, CosyVoice). For example, on LJ Speech with Matcha-TTS, baseline PMOS is $4.217$ and SFM ( $\alpha=2.5$ ) achieves $4.257$. Inference speed improvements of up to $+60.8\%$ are reported for Heun(2)-based solvers. Subjective CMOS preference studies further corroborate relative improvements (Yang et al., 18 May 2025).

7. Practical Considerations and Extensions

Appropriate tuning of SFM strength $\alpha$ is critical for optimal performance, typically achieved through validation grid search over $\alpha \in [2,4]$ . The method requires a coarse generator capable of yielding high-fidelity mel-spectrogram estimates as a foundation. Ablations indicate that using $X_g$ directly (SFM-c) results in collapse to $t_h \to 0$ , and omitting speaker embeddings (SFM-t) impairs zero-shot speaker similarity. Training hyper-parameters and data flows for SFM generalize across architectures and modalities, and the SFM concept is extensible to other CondOT-based FM setups and potentially to diffusion or super-resolution tasks (Yang et al., 18 May 2025).

Architecture	PMOS (Baseline)	PMOS (SFM, $\alpha=2.5$ )	WER (Baseline)	WER (SFM)	Speed-up (RTF)
Matcha-TTS (LJ)	4.217	4.257	3.308%	3.413%	+47.6% -- +60.8% (solvers)
Matcha-TTS (VCTK)	4.026	4.106	1.534%	0.952%
CosyVoice (LibriTTS)	4.183	4.194	3.513%	3.810%

The empirical evidence supports the utility of SFM in reducing computational cost and improving output quality in coarse-to-fine generative frameworks.

References

"Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis" (Yang et al., 18 May 2025)

Markdown Report Issue Upgrade to Chat

References (1)

Shallow Flow Matching for Coarse-to-Fine Text-to-Speech Synthesis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Shallow Flow Matching (SFM).