Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
122 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
48 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Flow-Matching-Based JAM: Deterministic Generative Modeling

Updated 29 July 2025
  • The paper introduces flow matching using deterministic ODE trajectories to map noise to complex multimodal targets, improving stability and computational efficiency.
  • It enables fine-grained control in song generation by integrating word-level timing and duration bias, leading to superior alignment and musical coherence.
  • The framework incorporates direct preference optimization to achieve automated aesthetic alignment, enhancing both subjective quality and inference speed.

A flow-matching-based JAM refers to generative frameworks and multi-modal synthesis systems that employ flow matching—a method for learning deterministic mappings—from a simple reference distribution to a complex target (such as song, speech, or joint audio-motion sequences), with an emphasis on joint modeling and alignment. The acronym "JAM" appears variously as "Joint Audio-Motion," "Joint Alignment Model," and, specifically in the context of lyrics-to-song generation, as "JAM: A Tiny Flow-based Song Generator with Fine-grained Controllability and Aesthetic Alignment." These systems use flow matching as the principal generative engine and often include mechanisms for fine-grained control or joint matching of multiple modalities.

1. Flow Matching as the Foundation of JAM

Flow matching is a simulation-free generative modeling approach in which a parameterized vector field vθ(xt,t)v_\theta(x_t, t) is trained to map a latent sample along a deterministic trajectory, typically modeled by an ordinary differential equation (ODE), from a source (e.g., noise) to a target (e.g., a song’s latent embedding or multimodal data). In the canonical conditional formulation, an intermediate latent ztz_t is constructed as a linear interpolation:

zt=(1t)z1+tz0z_t = (1 - t) z_1 + t z_0

where z1z_1 is the clean target sample, z0z_0 is a noise sample, and t[0,1]t \in [0,1]. The reference velocity is given by vt=z0z1v_t = z_0 - z_1, and the conditional flow matching loss sought during training is

LFM=Ez1,z0,t,c[u(zt,t,c;θ)(z0z1)2]\mathcal{L}_{FM} = \mathbb{E}_{z_1, z_0, t, c} \left[ \| u(z_t, t, c; \theta) - (z_0 - z_1) \|^2 \right]

where uu is the learned velocity field and cc denotes conditioning information (such as lyrics, timing, or multimodal cues) (Liu et al., 28 Jul 2025, Kwon et al., 30 Jun 2025). This approach robustly generalizes score-based diffusion methods by directly learning the transport vector field, bypassing the need to learn the score function (gradient of log-probability) or to simulate or invert stochastic differential equations, leading to improved stability and efficiency.

2. Fine-grained and Joint Controllability in Song Generation and Multi-Modal Synthesis

JAM distinguishes itself by incorporating conditional mechanisms supporting fine-grained (e.g., word or phoneme-level) timing and duration control in song synthesis (Liu et al., 28 Jul 2025). Input lyrics are encoded alongside detailed timing annotations, and per-token timing embeddings are upsampled and fused into the latent representation. This layered control paradigm encompasses:

  • Word-level timing: Each word wiw_i is associated with (tistart,tiend)(t_i^{\text{start}}, t_i^{\text{end}}) and appropriate phoneme sequences.
  • Global duration control: The song’s overall length is directed by encodable duration inputs.
  • Token-Level Duration Control (TDC): Beyond a target duration, latent tokens receive a learnable bias to separate genuine musical content from silence or padding.

In the joint audio-motion case (JAM-Flow), multiple modalities (e.g., speech and facial motion for talking head generation) are synthesized in a mutually conditioned fashion. The architecture (MM-DiT) partitions transformer blocks into “modality-specific” and “selective joint attention” layers, enabling partial information fusion. Temporal alignment is ensured through scaled rotary positional embeddings and custom attention masking strategies (Kwon et al., 30 Jun 2025).

3. Aesthetic Alignment and Preference Optimization

For generative song synthesis, technical alignment (timing, word error rate) does not automatically yield musically pleasing results. JAM incorporates Direct Preference Optimization (DPO) to achieve aesthetic alignment, iteratively refining the model based on synthetic preference pairs. Outputs are batch-generated, ranked via an automated aesthetic scorer (e.g., SongEval on vocal naturalness, enjoyment, structure), and the model is then updated using a ranking-based objective:

LDPO-FM=Et,xW,xLlogσ(β[u(xtW,t;θ)vtW2u(xtL,t;θ)vtL2(reference terms)])\mathcal{L}_{\text{DPO-FM}} = -\mathbb{E}_{t,x^W,x^L} \log \sigma \left( -\beta \left[ \| u(x_t^W, t; \theta) - v_t^W \|^2 - \| u(x_t^L, t; \theta) - v_t^L \|^2 - (\text{reference\ terms}) \right] \right)

where xWx^W (winner) and xLx^L (loser) are contrasting generations, and σ\sigma is the sigmoid function (Liu et al., 28 Jul 2025). This bootstraps the preference alignment without requiring manual annotation, leading to improved subjective and musical quality.

4. Evaluation and Benchmarking

JAM’s performance is evaluated using the JAME dataset, which provides genre-clustered, contamination-free benchmarks for full-song generation. Metrics include:

  • Word and phoneme error rates (WER, PER): Improved by word-level alignment.
  • Style adherence and musical aesthetics: Measured via MuQ-MuLan similarity, SongEval, and Fréchet Audio Distance (FAD).
  • Inference efficiency: Achieved with a lightweight (530M parameter) architecture, outperforming or matching larger diffusion or transformer models on both accuracy and speed.

JAM demonstrates improved intelligibility, musical coherence, and content enjoyment compared to contemporaneous models such as DiffRhythm, ACE-Step, and LeVo. Performance is robust across music genres, establishing JAM as a new baseline in controllable song generation (Liu et al., 28 Jul 2025).

5. Broader Applications and Innovations

The flow-matching-based JAM paradigm extends beyond music:

  • Joint Audio-Motion Synthesis: JAM-Flow unifies talking head generation and TTS in a single model, supporting text-, audio-, or motion-driven conditioning. It achieves tight temporal alignment (lip motion and speech), generalized via inpainting-style training (Kwon et al., 30 Jun 2025).
  • Multi-modal transformers (MM-DiT): The architecture employs cross-modal fusion at early layers and modality specialization at deeper layers, attaining efficient mutual conditioning while preserving domain-specific representations.
  • Generalization to Other Tasks: The principles underlying flow-matching-based JAM—error-minimizing straight flow trajectories, deterministically controlled ODE-based mapping, and joint conditional objectives—are relevant for sequential recommendation, robotic policy synthesis, and broader multi-modal generation scenarios.

6. Future Directions

Key research avenues include:

  • Automatic duration/phoneme predictors: Reducing dependence on annotated timings by jointly training sequence duration predictors with the generator.
  • Expressive/nuanced control: Extending control mechanisms to pitch, vibrato, and dynamics, enabling richer expressive synthesis.
  • Hybrid preference optimization: Combining automated and limited manual feedback for further aesthetic alignment.
  • Cross-lingual and cross-modal extension: Supporting more languages and modalities (gesture, instrument, environmental context), potentially leveraging additional flow-matching variants for robust joint modeling.

7. Summary Table: Technical Innovations in Flow-Matching-Based JAM

Feature JAM (Song Generation) JAM-Flow (Audio-Motion)
Foundation Conditional flow matching ODE Conditional flow matching ODE
Fine-grained control Word/phoneme timing, duration bias Temporal positional fusion, attention
Aesthetic alignment DPO on automated preference pairs Inpainting objective for cross-modality
Key architecture Latent VAE + timing embedder MM-DiT: fused transformer blocks
Benchmarking JAME (multi-genre) CelebV-Dub, HDTF, LibriSpeech-PC
Inference efficiency Lightweight; simulation-free ODE ODE-based; efficient cross-modal gen.

Both JAM and JAM-Flow exemplify how flow matching enables deterministic, controllable joint generation of complex outputs—through either explicit temporal alignment in music or cross-modal fusion of diverse data streams—offering high sample quality, controllability, and computational efficiency absent in traditional diffusion or autoregressive approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)