Papers
Topics
Authors
Recent
2000 character limit reached

FM-Singer: Flow Matching for Singing Voices

Updated 8 January 2026
  • FM-Singer is a neural architecture employing conditional flow matching to align latent prior and posterior distributions for expressive singing synthesis and conversion.
  • It integrates cVAE-based and flow-matching decoders to mitigate prior–posterior mismatches, preserving detailed attributes like vibrato and micro-prosody.
  • Empirical evaluations show FM-Singer outperforms baseline models in metrics such as MCD, F0 RMSE, and MOS across multiple language corpora.

FM-Singer designates a class of neural architectures that employ Conditional Flow Matching (CFM) in the latent spaces of generative singing voice models, principally for expressive singing voice synthesis and high-fidelity singing voice conversion. FM-Singer integrates CFM with conditional variational autoencoder (cVAE) or conditional flow-matching decoders, aiming to mitigate prior–posterior mismatch and enhance the preservation of fine-grained expressive attributes such as vibrato and micro-prosody. The approach provides a statistically optimal transport framework for aligning prior and posterior distributions, thus improving the naturalness, expressiveness, and quality of synthesized singing voices (Yun et al., 1 Jan 2026, Chen et al., 8 Aug 2025).

1. Prior–Posterior Mismatch in Conventional Singing Voice Models

Contemporary singing voice synthesis systems, such as cVAE+GAN backbones, model the generative process by introducing a latent variable zz to capture variability that is not directly explained by the score-conditioned inputs cc (which encode phonetic, pitch, and duration information). During training, the model learns:

  • Posterior encoder: qϕ(zx,c)q_\phi(z\mid x,c), where xx is the mel spectrogram of the singing recording.
  • Prior network: pθ(zc)p_\theta(z\mid c), modeling zz conditional on the score.
  • Decoder: pθ(xz,c)p_\theta(x\mid z,c), reconstructing the mel spectrogram from zz and cc.

The objective function is the cVAE loss,

LcVAE=Ezqϕ(zx,c)[logpθ(xz,c)]+KL(qϕ(zx,c)pθ(zc)).L_{cVAE} = -\,\mathbb{E}_{z\sim q_\phi(z\mid x,c)}\bigl[\log p_\theta(x\mid z,c)\bigr] +\mathrm{KL}\bigl(q_\phi(z\mid x,c)\,\|\,p_\theta(z\mid c)\bigr).

During inference, zz is sampled from the prior pθ(zc)p_\theta(z\mid c), which can be misaligned with the posterior sampled during training, leading to a degradation of expressive micro-attributes. This mismatch results in a loss of vibrato and over-smoothing of fine-grained dynamics (Yun et al., 1 Jan 2026).

2. Conditional Flow Matching in Latent Space

FM-Singer introduces Conditional Flow Matching to bridge the expressiveness gap by explicitly learning a continuous transport field in latent space from the prior to the posterior. The method defines a time-dependent vector field f(z(t),t;c)f(z(t), t; c) governing the evolution of latent variables:

dz(t)dt=f(z(t),t;c),t[0,1]\frac{d z(t)}{d t} = f(z(t), t; c), \quad t\in[0,1]

The target trajectory is a linear interpolation between a latent sampled from the prior z0pθ(zc)z_0\sim p_\theta(z\mid c) and one from the posterior z1qϕ(zx,c)z_1\sim q_\phi(z\mid x,c): zt=(1t)z0+tz1,z˙t=z1z0z_t = (1-t) z_0 + t z_1,\qquad \dot{z}_t^* = z_1 - z_0

The flow matching loss minimizes the squared error between the model vector field and the true velocity: LFM=Ez0,z1,t[f(zt,t;c)(z1z0)2]L_{FM} = \mathbb{E}_{z_0,z_1,t}\left[ \big\| f(z_t, t; c)-(z_1-z_0)\big\|^2 \right]

This conditional flow field enables explicit prior-to-posterior transportation, improving the synthesis of expressive details.

3. Integrated Training and Inference Workflow

The full FM-Singer training objective fuses the cVAE loss, the flow-matching loss, and adversarial/generative losses:

L=LcVAE+λLFM+LGAN/recon/auxL = L_{cVAE} + \lambda L_{FM} + L_{\text{GAN/recon/aux}}

where LGAN/recon/auxL_{\text{GAN/recon/aux}} comprises adversarial, feature matching, mel-reconstruction, and auxiliary (duration/pitch) regularizations. Empirically, λ=1\lambda=1 balances this composite loss effectively.

For inference, the model samples an initial z(0)pθ(zc)z(0)\sim p_\theta(z\mid c) and integrates the ODE governed by ff from t=0t=0 to t=1t=1, typically using an adaptive Dormand–Prince (DOPRI5) solver or fixed-step Euler discretization. The result is a refined latent, z(1)z(1), rendered into a waveform by a HiFi-GAN-style generator or similar decoder. This preserves parallel decoding, eliminating autoregressive bottlenecks while restoring fine-grained expressiveness (Yun et al., 1 Jan 2026).

4. Architectural Components and Hyperparameters

FM-Singer architectures typically exhibit:

  • Encoders: WaveNet-style convolutional–residual blocks for prior/posterior estimation, with hidden channel dimensionality of 1024 and 4–6 layers.
  • CFM Module: Deep depth-separable convolutional stacks (e.g., DDSConv) with kernel size 3, dilation rates [3,5,7,9], and dropout 0.1; time conditioning via FiLM or concatenation of sinusoidal t-embeddings.
  • Generator: HiFi-GAN-influenced upsampler with multi-rate transposed convolutions, distributed over multiple time and frequency scales.
  • Discriminators: Multi-Period, Multi-Scale, and Multi-Resolution Spectrogram adversaries, targeting distinct artifacts.
  • Solver: DOPRI5 (atol, rtol = 10510^{-5}, max step = 0.1) or fixed-step Euler (Δt=1/N\Delta t = 1/N).
  • Optimization: AdamW with lr104\mathrm{lr} \sim 10^{-4} (generator), 2×1042 \times 10^{-4} (discriminator), β1=0.8\beta_1=0.8, β2=0.99\beta_2=0.99; typical batch size: 16 on A100-class hardware.

These settings have proven robust on both Korean and Chinese singing corpora (Yun et al., 1 Jan 2026).

5. Empirical Evaluation and Comparative Performance

FM-Singer has been rigorously evaluated against state-of-the-art baselines, notably VISinger2 (cVAE+GAN without flow), DiffSinger (diffusion-based), and two-stage FastSpeech + RefineGAN pipelines. Performance has been established on Korean and Chinese singing datasets using mel-cepstral distortion (MCD), F0 RMSE, and mean opinion scores (MOS):

Model MCD (dB) F0 RMSE (Hz) MOS
VISinger2 NF (KR) 5.78 39.1 ~3.57
FM-Singer (KR) 4.82 35.8 ~4.04
VISinger2 NF (CN) ~2.94 25.5
FM-Singer (CN) ~2.70 25.2

Qualitative analysis confirms greater vibrato depth, more consistent micro-prosody, and pitch trajectories matching target recordings in expressive regions (e.g., vibrato). Cleaner harmonic structure is also observed in generated outputs (Yun et al., 1 Jan 2026).

FM-Singer's conditional flow-matching also underpins advanced singing voice conversion systems, such as DAFMSVC, where dual cross-attention fusion (over SSL features, melody, speaker embeddings) and CFM-based decoders yield further improvements in timbre similarity (SSIM), pitch/loudness accuracy (F0CORR, Loudness RMSE), and subjective MOS compared to DDSP-SVC, So-VITS-SVC, and NeuCoSVC (Chen et al., 8 Aug 2025).

6. Broader Applications and Methodological Impact

The FM-Singer framework is applicable not only to synthesis but also to singing voice conversion and deepfake detection tasks. In the conversion setting, DAFMSVC demonstrates the extension of FM-Singer principles to timbre transfer, using SSL feature replacement and dual attention to fuse target timbre and source content prior to CFM-based waveform generation. In detection, foundation model studies leverage x-vector-based speech models, which, due to their sensitivity to pitch and micro-prosody, outperform large music foundation models for detecting synthesized singing vocals. Hybrid architectures (e.g., FIONA) fuse representations via kernel alignment methods, further improving discrimination between bona-fide and deepfake singing (Phukan et al., 2024).

7. Contributions and Prospective Developments

FM-Singer's main scientific contributions include:

  1. Explicit identification and treatment of prior–posterior expressive mismatch in cVAE-based singing synthesis.
  2. Introduction of latent-space conditional flow matching yielding ODE-driven transports that preserve expressiveness in synthesis.
  3. Lightweight, parallelizable module design capable of real-time inference and expressivity enhancement at minimal computational cost.
  4. Empirical benchmark dominance across both speech synthesis and singing voice conversion tests in multiple languages.
  5. Architectural extensibility to broader generative and detection pipelines in audio/music AI.

Future directions include scaling to larger, more diverse datasets, integration with broader self-supervised or cross-modal (lyrics, video) models, lightweight on-device inference via distillation, and adversarially robust detection of synthetic singing content under high-resemblance conditions (Yun et al., 1 Jan 2026, Chen et al., 8 Aug 2025, Phukan et al., 2024).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to FM-Singer.