Warm-Start Flow Matching for Guaranteed Fast Text/Image Generation

Published 19 Mar 2026 in cs.LG | (2603.19360v1)

Abstract: Current auto-regressive (AR) LLMs, diffusion-based text/image generative models, and recent flow matching (FM) algorithms are capable of generating premium quality text/image samples. However, the inference or sample generation in these models is often very time-consuming and computationally demanding, mainly due to large numbers of function evaluations corresponding to the lengths of tokens or the numbers of diffusion steps. This also necessitates heavy GPU resources, time, and electricity. In this work we propose a novel solution to reduce the sample generation time of flow matching algorithms by a guaranteed speed-up factor, without sacrificing the quality of the generated samples. Our key idea is to utilize computationally lightweight generative models whose generation time is negligible compared to that of the target AR/FM models. The draft samples from a lightweight model, whose quality is not satisfactory but fast to generate, are regarded as an initial distribution for a FM algorithm. Unlike conventional usage of FM that takes a pure noise (e.g., Gaussian or uniform) initial distribution, the draft samples are already of decent quality, so we can set the starting time to be closer to the end time rather than 0 in the pure noise FM case. This will significantly reduce the number of time steps to reach the target data distribution, and the speed-up factor is guaranteed. Our idea, dubbed {\em Warm-Start FM} or WS-FM, can essentially be seen as a {\em learning-to-refine} generative model from low-quality draft samples to high-quality samples. As a proof of concept, we demonstrate the idea on some synthetic toy data as well as real-world text and image generation tasks, illustrating that our idea offers guaranteed speed-up in sample generation without sacrificing the quality of the generated samples.

Abstract PDF Upgrade to Chat

Authors (1)

Minyoung Kim

Summary

The paper presents Warm-Start Flow Matching (WS-FM) which uses lightweight draft models to initialize generative flows, significantly reducing required time steps.
It refines initial outputs via LLMs for text and nearest neighbor searches for images, ensuring high-quality generation results.
Empirical evaluations on synthetic, text, and image domains demonstrate up to 10× speed-ups with maintained or improved sample fidelity.

Warm-Start Flow Matching for Efficient and High-Quality Generative Modeling

Motivation and Problem Statement

Generative modeling with AR LLMs, diffusion, and flow matching (FM) architectures achieves state-of-the-art quality in text and image synthesis but typically suffers from substantial computational and inference latency. This challenge largely originates from the reliance on low-information initial distributions (e.g., pure noise) for FM or diffusion processes that require extensive function evaluations over numerous time steps for robust sample generation. The prevailing approach imposes significant demands on resources, including GPU utilization and power consumption, thus impeding scalability and real-time deployment.

Proposed Methodology: Warm-Start Flow Matching (WS-FM)

The paper introduces Warm-Start Flow Matching (WS-FM), a principled technique to guarantee fast sample generation for FM-based models. WS-FM leverages computationally lightweight generative models (e.g., small LSTM or GAN architectures) to produce draft samples of decent quality, which serve as the initial distribution for refinement via FM. This approach is contrasted against conventional FM methods that start from pure noise. By employing drafts that are already closer to the target distribution, WS-FM can commence the FM time index $t_0$ closer to the terminal point $t=1$ , reducing the required time steps by a factor of $1/(1-t_0)$ . This speed-up is not only theoretical but empirically verified with no degradation in sample quality.

A refinement strategy is adopted to establish efficient coupling between draft and target distribution samples. In text domains, off-the-shelf LLMs are used to refine draft texts by prompting natural, grammatically correct outputs, while for images, nearest neighbor selection from the dataset provides refinement. Paired data $(x_{t_0}, x_1)$ is then used during FM training, ensuring that the interpolation and velocity field learning is confined to the span $t_0 \leq t \leq 1$ .

Theoretical Guarantees and Implementation Details

The WS-FM framework operates within discrete-state FM paradigms, applicable to both text and image generation tasks. The methodology encompasses:

Training lightweight draft models (LSTM or GAN) on available data distribution $P_1$ .
Forming refined pairings $(x_{t_0}, x_1)$ via LLM prompt or nearest neighbor strategies.
Learning the velocity network $u(t, x_t)$ only over $t \in [t_0, 1]$ , normalizing interpolation and generator update steps accordingly.
Implementing velocity time-warping in inference to guarantee consistent transformation from $P_{t_0}$ to $P_1$ at reduced computational cost.

Empirical selection of $t_0$ is conducted to maximize speed-up without compromising sample fidelity. The discriminative loss formulation relies on cross-entropy between generator logits and refined targets, using paired data to optimize the FM velocity network.

Numerical Results: Synthetic, Text, and Image Domains

Extensive experiments demonstrate WS-FM's efficacy:

Synthetic Two Moons Dataset: WS-FM achieves sample quality equal to or better than conventional DFM [Gat et al., 2024], with speed-up factors (e.g., ×10, ×2, ×1.5) contingent on draft model quality.
Text Generation (Text-8 and Wikitext-103): In downstream evaluation using GPT-J-6B as oracle, WS-DFM outperforms original DFM in next-token NLL and perplexity, with ×5 and ×2 speed-up factors. LSTM draft generation is negligible in compute, and WS-DFM yields results competitive even when compared to the LLM used for refinement.
Image Generation (CIFAR-10 grayscale and color): WS-DFM exhibits improved FID scores over DFM and DC-GAN baselines, with substantial reduction in per-sample generation time (e.g., ×5 speed-up for $t_0 = 0.8$ ), confirming robust refinement of low-quality GAN outputs to high-quality images.

Notably, the speed-up factors match theoretical anticipation, and sample quality is preserved or improved, which contradicts the assumption that reducing model time steps inevitably sacrifices fidelity.

Implications and Future Directions

WS-FM provides a general mechanism for accelerating FM-based generative modeling, establishing a learning-to-refine paradigm that is compatible with both discrete and continuous domains. Practical implications include:

Reduced latency for LLM-based chatbots and real-time image generation/editing, enabling deployment on user devices with limited hardware resources.
Applicability in speech enhancement and voice conversion, facilitating real-time operation via rapid draft-to-target refinement.
Potential for improving FM efficiency in domains where expensive cold-start initialization imposes a computational bottleneck.

From a theoretical perspective, WS-FM underscores the importance of informative initial distributions in generative processes and offers a prescriptive approach to coupling in FM training. The performance gains observed counter canonical assumptions regarding the trade-offs in quality vs. speed for generative architectures.

Possible future developments include extending WS-FM to multi-modal and non-discrete domains, more sophisticated refinement procedures, and integration with scalable foundation models for draft generation.

Conclusion

Warm-Start Flow Matching (WS-FM) offers a guaranteed and empirically validated speed-up in flow matching-based text and image generation, by leveraging lightweight draft models and efficient refinement strategies without sacrificing generation quality. The method presents both practical and theoretical advances, enabling efficient deployment and highlighting the role of informative initialization and pairing in FM generative modeling (2603.19360).

Markdown Report Issue