Papers
Topics
Authors
Recent
Search
2000 character limit reached

CaLMFlow: VIE-Based Generative Modeling

Updated 23 February 2026
  • CaLMFlow is a generative modeling framework that reformulates flow matching as a Volterra integral equation, integrating causal language models for sequence-based continuous data generation.
  • It outperforms traditional ODE-based methods by reducing numerical instability and achieving improved performance metrics, such as lower 2-Wasserstein scores in high dimensions.
  • The framework tokenizes both spatial and temporal dimensions and supports conditional generation via textual prompts, enabling multi-trajectory context and flexible applications.

CaLMFlow is a generative modeling framework that formulates flow matching as a Volterra integral equation (VIE) and leverages causal LLMs (CLMs) for continuous data generation. It bridges the methodologies of discrete language modeling and continuous generative modeling by recasting flow matching as a sequence modeling problem and implementing tokenization across both space and time. CaLMFlow is designed for high-dimensional, context-aware generative tasks where conventional ODE solver-dependent methods such as conditional flow matching (CFM) exhibit limitations in scalability and flexibility. The framework utilizes LLMs as function approximators, enabling direct learning of complex flows and facilitating the incorporation of textual prompts for conditional generation (He et al., 2024).

1. Mathematical Foundations: Volterra Integral Equation Formulation

Classical flow matching seeks a time-dependent vector field v(ϕ,t)v(\phi,t) such that the ordinary differential equation

dϕdt=v(ϕ(t),t),ϕ(0)=ϕ0\frac{d\phi}{dt} = v(\phi(t),t), \quad \phi(0)=\phi_0

describes the evolution of states, which in its integral (second-kind Volterra) form is

ϕ(t)=ϕ0+0tv(ϕ(s),s)ds.(1)\phi(t) = \phi_0 + \int_0^t v(\phi(s),s)\,ds. \tag{1}

CaLMFlow generalizes this dynamic by introducing an explicit inhomogeneous Volterra integral equation: zt=f(zt,t)+0tG(zs,t,s)ds,(2)z_t = f(z_t,t) + \int_0^t G(z_s, t, s)\, ds, \tag{2} where f(zt,t)f(z_t,t) acts as an inhomogeneous term and G(zs,t,s)G(z_s, t, s) is a Urysohn-type kernel embedding dependence on the trajectory's past. The discretization of the temporal domain, coupled with spatial tokenization, enables CLMs (e.g., GPT-2 or Pythia) to approximate this integral operator autoregressively: the model predicts each successive state conditioned on the sequence of previous tokens, with the transformer's attention weights serving as a functional surrogate for the integral kernel and vector-field terms.

2. Objectives, Loss Functions, and Tokenization

Directly optimizing the Volterra flow–matching loss

LVFM=Ep(zN)zNz^N2\mathcal{L}_{\rm VFM} = \mathbb{E}_{p(z^N)} \left\| z^N - \hat z^N \right\|^2

is infeasible due to intractable marginal distributions. CaLMFlow adopts the conditional Volterra flow‐matching (CVFM) strategy with linear interpolation (analogous to OT linear paths): zz0,zNN(ti)=(1ti)z0+tizN,z^N_{z_0,z_N}(t_i) = (1-t_i)\,z_0 + t_i\,z_N, with a loss function

LCVFM=Ez0p0,zNqzz0,zNNz^N2.(3)\mathcal{L}_{\rm CVFM} = \mathbb{E}_{z_0 \sim p_0,\,z_N \sim q} \left\| z^N_{z_0,z_N} - \hat z^N \right\|^2. \tag{3}

The VIE-discretized next-state prediction is

z^i+1=fθ(zi,ti+1)+j=0iΔti+1Gθ(zj,ti+1,tj),Δtk=tktk1.(4)\hat z^{i+1} = f_\theta(z^i, t_{i+1}) + \sum_{j=0}^i \Delta t_{i+1} G_\theta(z_j, t_{i+1}, t_j),\quad \Delta t_{k} = t_k-t_{k-1}. \tag{4}

Continuous variables are modeled with a VAE head, introducing a KL penalty to form the total objective: LVCVFM=LCVFM+βKL(qϕ(zx)p(z)).(5)\mathcal{L}_{\rm VCVFM} = \mathcal{L}_{\rm CVFM} + \beta\,\mathrm{KL}(q_\phi(z\,|\,x) \| p(z)). \tag{5} Tokenization occurs over temporal points (indexed by NN), spatial subdivisions per timepoint (each of dimension KK), and multi-trajectory context (with MM parallel trajectories), yielding tensorized inputs for efficient CLM processing.

3. Model Architecture and Training Procedures

The CLM backbone for CaLMFlow is a decoder-only transformer with causal self-attention, trained to map sequentially ordered tokens (optionally prefixed with textual prompts) to next-token predictions. For each token, a small VAE head predicts mean and variance parameters (μi,σi)(\mu_i,\sigma_i) for continuous token sampling. The reconstruction uses ELBO during training; diversity at inference is regulated by sampling temperature τ\tau. Temporal order is preserved by the attention mask, and position and learned token embeddings encode both time and space.

Key architectural features:

  • Causal self-attention (upper-triangular masking)
  • Variable sequence lengths depending on spatiotemporal granularity (NN, KK), and number of joint trajectories (MM)
  • A small MLP (SθS_\theta) for projecting input instances into token embeddings

High-level training pseudocode is:

1
2
3
4
5
6
7
8
9
10
11
12
for each minibatch:
    sample (z0,zN) pairs: z0 ~ p0, zN ~ q
    form linear path: z_path(t_i) = (1-t_i) z0 + t_i zN for i=0...N
    tokenize into X = Tokenize(z_path)
    [optionally prepend text prompt tokens]
    H = CLM.encode(X)
    for i=0...N-1:
        μi,σi = VAEHead(H[i])
        reconstruct xi via pψ(xi|zi)  # ELBO
    build \hat z sequence via next-token predictions
    compute loss: MSE + β KL(qφ||p)
    backpropagate & update θ_CLM, φ, ψ, Sθ
Inference proceeds autoregressively by sampling from the VAE head conditioned on the previously generated tokens (He et al., 2024).

4. Comparative Analysis with ODE-based and Flow Matching Methods

Classical ODE-based generative modeling (including CNFs and neural ODEs) relies on solving differential equations of the form ϕ˙=v(ϕ,t)\dot{\phi}=v(\phi,t) using adaptive solvers such as dopri5, necessitating either expensive adjoint methods or simulation-free scores as in conditional flow matching (CFM). CFM learns vector fields to match OT-linear reference paths but remains constrained by the ODE paradigm.

CaLMFlow eliminates the need for any black-box ODE solvers by reframing the objective as a Volterra integral equation and applying simulation-free, sequence-based prediction. This results in:

  • Reduced stiffness and numerical instability in high-dimensional settings
  • More stable training and inference, even at $1000$D (where CFM's 2-Wasserstein metric deteriorates to 25\sim 25; CaLMFlow achieves $8-11$)
  • Native support for multi-trajectory context and incorporation of auxiliary textual cues
  • Empirical improvements in MMD and 2-Wasserstein metrics relative to CFM and DDPM baselines

5. Experimental Evaluation and Empirical Findings

CaLMFlow has been empirically evaluated across three domains:

Synthetic distributions (Table 1):

  • Tasks: Gaussian→2 Gaussians, Gaussian→8 Gaussians, Gaussian→2 Moons in 100D and 1000D.
  • At 100D, CaLMFlow achieves 2-Wasserstein scores of $2.3–3.1$ vs. CFM's 5.0\sim 5.0.
  • At 1000D, CFM fails (25\sim 25); CaLMFlow maintains $8–11$.
  • Multi-trajectory context (e.g., M=8M=8) further reduces 2-Wass scores.

Spatiotemporal MNIST (Table 7):

  • Modeling image patches as spatiotemporal tokens (varying KK, NN); CaLMFlow achieves Inception Scores up to $9.43$ (with 8 tokens), outperforming DDPM and CFM.

Single-cell generation (Section 5.2):

  • Dataset: immune-tissue scRNA-seq (1,000 PCs, 7 cell types × 10 perturbations × 2 chronicities).
  • Metrics: MMD, 2-Wass, Leiden-KLD, adMMD.
  • Unconditional: CaLMFlow (1 traj: MMD 0.0060; 5 traj: 0.0031) vs. CFM variants (0.080.10\sim 0.08–0.10).
  • Conditional (Tables 4-5): Holding out 5 unseen combinatorial labels and conditioning on text prompts, CaLMFlow (NL-pretrained) achieves best-in-class metrics (MMD 0.0181 vs. CFM 0.1105; 2-Wass 0.0150 vs. 0.0435; R2R^2 correlation 0.99\approx 0.99 vs. CFM's 0.41\approx 0.41).
  • UMAP visualizations show CaLMFlow tightly matching ground-truth clusters.
Domain CFM CaLMFlow (Best)
100D 2-Wass (synthetic) ∼5.0 2.3–3.1
1000D 2-Wass ∼25 8–11
Single-cell MMD ∼0.08–0.10 0.0031–0.0060
R2R^2 (Conditional) ∼0.41 ∼0.99

6. Integration of Textual Context and Generalization

In conditional tasks, CaLMFlow enables context-aware generation by allowing natural-language prompts to condition the generative process. Textual conditions (e.g., “Generate a CD4 T cell stimulated with IL-6 and exposure acute:”) are tokenized and prepended to spatiotemporal input tokens. Two configurations exist:

  • CaLMFlow(R.I.): randomly-initialized CLM
  • CaLMFlow(N.L.): CLM initialized from a pretrained natural-LLM (e.g., Pythia)

Both configurations generalize to compositional conditions not observed during training; the NL-pretrained model demonstrates quantitatively superior performance, indicating that transfer of language understanding to conditioning improves data-driven generalization.

7. Limitations and Future Directions

Critical limitations and proposed directions include:

  • Fidelity vs. computational efficiency: Higher temporal (NN) and spatial (KK) resolution enhances sample quality but increases memory and compute requirements.
  • Mathematical formalism: Rigorous Banach-space foundations for multi-trajectory integral solver variants remain undeveloped.
  • VAE decoding and temperature τ\tau: Hyperparameter tuning is necessary to balance sample diversity and reconstruction accuracy.
  • Model scale: CLM size bounds current applicability in ultra-high-dimensional spaces; scaling up the CLM backbone is anticipated to yield further improvements.
  • Research avenues: Iterative VIE solvers for full trajectory refinement, alternative kernel parameterizations (GθG_\theta) with cross-attention, extension to multimodal data, and hybrid ODE–IE architectures to unify flow- and diffusion-matching approaches are open problems (He et al., 2024).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CaLMFlow.