CaLMFlow: VIE-Based Generative Modeling

Updated 23 February 2026

CaLMFlow is a generative modeling framework that reformulates flow matching as a Volterra integral equation, integrating causal language models for sequence-based continuous data generation.
It outperforms traditional ODE-based methods by reducing numerical instability and achieving improved performance metrics, such as lower 2-Wasserstein scores in high dimensions.
The framework tokenizes both spatial and temporal dimensions and supports conditional generation via textual prompts, enabling multi-trajectory context and flexible applications.

CaLMFlow is a generative modeling framework that formulates flow matching as a Volterra integral equation (VIE) and leverages causal LLMs (CLMs) for continuous data generation. It bridges the methodologies of discrete language modeling and continuous generative modeling by recasting flow matching as a sequence modeling problem and implementing tokenization across both space and time. CaLMFlow is designed for high-dimensional, context-aware generative tasks where conventional ODE solver-dependent methods such as conditional flow matching (CFM) exhibit limitations in scalability and flexibility. The framework utilizes LLMs as function approximators, enabling direct learning of complex flows and facilitating the incorporation of textual prompts for conditional generation (He et al., 2024).

1. Mathematical Foundations: Volterra Integral Equation Formulation

Classical flow matching seeks a time-dependent vector field $v(\phi,t)$ such that the ordinary differential equation

$\frac{d\phi}{dt} = v(\phi(t),t), \quad \phi(0)=\phi_0$

describes the evolution of states, which in its integral (second-kind Volterra) form is

$\phi(t) = \phi_0 + \int_0^t v(\phi(s),s)\,ds. \tag{1}$

CaLMFlow generalizes this dynamic by introducing an explicit inhomogeneous Volterra integral equation: $z_t = f(z_t,t) + \int_0^t G(z_s, t, s)\, ds, \tag{2}$ where $f(z_t,t)$ acts as an inhomogeneous term and $G(z_s, t, s)$ is a Urysohn-type kernel embedding dependence on the trajectory's past. The discretization of the temporal domain, coupled with spatial tokenization, enables CLMs (e.g., GPT-2 or Pythia) to approximate this integral operator autoregressively: the model predicts each successive state conditioned on the sequence of previous tokens, with the transformer's attention weights serving as a functional surrogate for the integral kernel and vector-field terms.

2. Objectives, Loss Functions, and Tokenization

Directly optimizing the Volterra flow–matching loss

$\mathcal{L}_{\rm VFM} = \mathbb{E}_{p(z^N)} \left\| z^N - \hat z^N \right\|^2$

is infeasible due to intractable marginal distributions. CaLMFlow adopts the conditional Volterra flow‐matching (CVFM) strategy with linear interpolation (analogous to OT linear paths): $z^N_{z_0,z_N}(t_i) = (1-t_i)\,z_0 + t_i\,z_N,$ with a loss function

$\mathcal{L}_{\rm CVFM} = \mathbb{E}_{z_0 \sim p_0,\,z_N \sim q} \left\| z^N_{z_0,z_N} - \hat z^N \right\|^2. \tag{3}$

The VIE-discretized next-state prediction is

$\hat z^{i+1} = f_\theta(z^i, t_{i+1}) + \sum_{j=0}^i \Delta t_{i+1} G_\theta(z_j, t_{i+1}, t_j),\quad \Delta t_{k} = t_k-t_{k-1}. \tag{4}$

Continuous variables are modeled with a VAE head, introducing a KL penalty to form the total objective: $\mathcal{L}_{\rm VCVFM} = \mathcal{L}_{\rm CVFM} + \beta\,\mathrm{KL}(q_\phi(z\,|\,x) \| p(z)). \tag{5}$ Tokenization occurs over temporal points (indexed by $N$ ), spatial subdivisions per timepoint (each of dimension $K$ ), and multi-trajectory context (with $M$ parallel trajectories), yielding tensorized inputs for efficient CLM processing.

3. Model Architecture and Training Procedures

The CLM backbone for CaLMFlow is a decoder-only transformer with causal self-attention, trained to map sequentially ordered tokens (optionally prefixed with textual prompts) to next-token predictions. For each token, a small VAE head predicts mean and variance parameters $(\mu_i,\sigma_i)$ for continuous token sampling. The reconstruction uses ELBO during training; diversity at inference is regulated by sampling temperature $\tau$ . Temporal order is preserved by the attention mask, and position and learned token embeddings encode both time and space.

Key architectural features:

Causal self-attention (upper-triangular masking)
Variable sequence lengths depending on spatiotemporal granularity ( $N$ , $K$ ), and number of joint trajectories ( $M$ )
A small MLP ( $S_\theta$ ) for projecting input instances into token embeddings

High-level training pseudocode is:

for each minibatch:
    sample (z0,zN) pairs: z0 ~ p0, zN ~ q
    form linear path: z_path(t_i) = (1-t_i) z0 + t_i zN for i=0...N
    tokenize into X = Tokenize(z_path)
    [optionally prepend text prompt tokens]
    H = CLM.encode(X)
    for i=0...N-1:
        μi,σi = VAEHead(H[i])
        reconstruct xi via pψ(xi|zi)  # ELBO
    build \hat z sequence via next-token predictions
    compute loss: MSE + β KL(qφ||p)
    backpropagate & update θ_CLM, φ, ψ, Sθ

Inference proceeds autoregressively by sampling from the VAE head conditioned on the previously generated tokens (He et al., 2024).

4. Comparative Analysis with ODE-based and Flow Matching Methods

Classical ODE-based generative modeling (including CNFs and neural ODEs) relies on solving differential equations of the form $\dot{\phi}=v(\phi,t)$ using adaptive solvers such as dopri5, necessitating either expensive adjoint methods or simulation-free scores as in conditional flow matching (CFM). CFM learns vector fields to match OT-linear reference paths but remains constrained by the ODE paradigm.

CaLMFlow eliminates the need for any black-box ODE solvers by reframing the objective as a Volterra integral equation and applying simulation-free, sequence-based prediction. This results in:

Reduced stiffness and numerical instability in high-dimensional settings
More stable training and inference, even at $1000$D (where CFM's 2-Wasserstein metric deteriorates to $\sim 25$ ; CaLMFlow achieves $8-11$)
Native support for multi-trajectory context and incorporation of auxiliary textual cues
Empirical improvements in MMD and 2-Wasserstein metrics relative to CFM and DDPM baselines

5. Experimental Evaluation and Empirical Findings

CaLMFlow has been empirically evaluated across three domains:

Synthetic distributions (Table 1):

Tasks: Gaussian→2 Gaussians, Gaussian→8 Gaussians, Gaussian→2 Moons in 100D and 1000D.
At 100D, CaLMFlow achieves 2-Wasserstein scores of $2.3–3.1$ vs. CFM's $\sim 5.0$ .
At 1000D, CFM fails ( $\sim 25$ ); CaLMFlow maintains $8–11$.
Multi-trajectory context (e.g., $M=8$ ) further reduces 2-Wass scores.

Spatiotemporal MNIST (Table 7):

Modeling image patches as spatiotemporal tokens (varying $K$ , $N$ ); CaLMFlow achieves Inception Scores up to $9.43$ (with 8 tokens), outperforming DDPM and CFM.

Single-cell generation (Section 5.2):

Dataset: immune-tissue scRNA-seq (1,000 PCs, 7 cell types × 10 perturbations × 2 chronicities).
Metrics: MMD, 2-Wass, Leiden-KLD, adMMD.
Unconditional: CaLMFlow (1 traj: MMD 0.0060; 5 traj: 0.0031) vs. CFM variants ( $\sim 0.08–0.10$ ).
Conditional (Tables 4-5): Holding out 5 unseen combinatorial labels and conditioning on text prompts, CaLMFlow (NL-pretrained) achieves best-in-class metrics (MMD 0.0181 vs. CFM 0.1105; 2-Wass 0.0150 vs. 0.0435; $R^2$ correlation $\approx 0.99$ vs. CFM's $\approx 0.41$ ).
UMAP visualizations show CaLMFlow tightly matching ground-truth clusters.

Domain	CFM	CaLMFlow (Best)
100D 2-Wass (synthetic)	∼5.0	2.3–3.1
1000D 2-Wass	∼25	8–11
Single-cell MMD	∼0.08–0.10	0.0031–0.0060
$R^2$ (Conditional)	∼0.41	∼0.99

6. Integration of Textual Context and Generalization

In conditional tasks, CaLMFlow enables context-aware generation by allowing natural-language prompts to condition the generative process. Textual conditions (e.g., “Generate a CD4 T cell stimulated with IL-6 and exposure acute:”) are tokenized and prepended to spatiotemporal input tokens. Two configurations exist:

CaLMFlow(R.I.): randomly-initialized CLM
CaLMFlow(N.L.): CLM initialized from a pretrained natural-LLM (e.g., Pythia)

Both configurations generalize to compositional conditions not observed during training; the NL-pretrained model demonstrates quantitatively superior performance, indicating that transfer of language understanding to conditioning improves data-driven generalization.

7. Limitations and Future Directions

Critical limitations and proposed directions include:

Fidelity vs. computational efficiency: Higher temporal ( $N$ ) and spatial ( $K$ ) resolution enhances sample quality but increases memory and compute requirements.
Mathematical formalism: Rigorous Banach-space foundations for multi-trajectory integral solver variants remain undeveloped.
VAE decoding and temperature $\tau$ : Hyperparameter tuning is necessary to balance sample diversity and reconstruction accuracy.
Model scale: CLM size bounds current applicability in ultra-high-dimensional spaces; scaling up the CLM backbone is anticipated to yield further improvements.
Research avenues: Iterative VIE solvers for full trajectory refinement, alternative kernel parameterizations ( $G_\theta$ ) with cross-attention, extension to multimodal data, and hybrid ODE–IE architectures to unify flow- and diffusion-matching approaches are open problems (He et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

CaLMFlow: Volterra Flow Matching using Causal Language Models (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CaLMFlow.

CaLMFlow: VIE-Based Generative Modeling

1. Mathematical Foundations: Volterra Integral Equation Formulation

2. Objectives, Loss Functions, and Tokenization

3. Model Architecture and Training Procedures

4. Comparative Analysis with ODE-based and Flow Matching Methods

5. Experimental Evaluation and Empirical Findings

6. Integration of Textual Context and Generalization

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CaLMFlow: VIE-Based Generative Modeling

1. Mathematical Foundations: Volterra Integral Equation Formulation

2. Objectives, Loss Functions, and Tokenization

3. Model Architecture and Training Procedures

4. Comparative Analysis with ODE-based and Flow Matching Methods

5. Experimental Evaluation and Empirical Findings

6. Integration of Textual Context and Generalization

7. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research