Transition Matching: Scalable and Flexible Generative Modeling (2506.23589v1)

Published 30 Jun 2025 in cs.LG and cs.AI

Abstract: Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.

Summary

The paper introduces Transition Matching as a unified paradigm that generalizes diffusion, flow matching, and autoregressive models for generative tasks.
It presents three variants (DTM, ARTM, FHTM) that achieve state-of-the-art image quality and a significant reduction in sampling steps, with DTM offering up to 7x speedup.
The framework’s flexible supervision and kernel design enable scalable, multimodal generation compatible with standard LLM architectures.

Transition Matching: A Unified and Scalable Paradigm for Generative Modeling

"Transition Matching: Scalable and Flexible Generative Modeling" (2506.23589) introduces Transition Matching (TM), a discrete-time, continuous-state generative modeling framework that unifies and generalizes diffusion, flow matching, and continuous autoregressive (AR) models. The paper systematically explores the design space of TM, presenting new algorithmic variants that achieve state-of-the-art performance in text-to-image generation, with significant improvements in sample quality, prompt alignment, and sampling efficiency.

Framework Overview

Transition Matching is formulated as learning a Markov process with parameterized transition kernels $p_{t+1|t}^\theta(x_{t+1}|x_t)$ , mapping from a simple source distribution $p_0$ to a complex target distribution $p_T$ over $T$ discrete steps. The process is supervised by a flexible "supervising process" $q_{0,\ldots,T}$ , which can be arbitrarily chosen as long as its marginal at $t=0$ matches $p_0$ . This generality enables the use of non-continuous, non-Gaussian, and highly expressive transition kernels, in contrast to the restrictive choices in standard diffusion or flow models.

The TM loss is defined as a divergence between the model transition kernel and the supervising process kernel, with the requirement that the divergence admits an empirical, sample-based form. This allows for efficient training via stochastic optimization.

Key TM Variants

The paper introduces three principal TM instantiations, each corresponding to a different modeling paradigm:

1. Difference Transition Matching (DTM)

Supervising Process: Standard linear (conditional optimal transport) process.
Parameterization: The model predicts the difference $Y = X_T - X_0$ given $X_t$ .
Modeling: The transition kernel is learned via flow matching, regressing the full conditional distribution of $Y$ .
Architecture: A large backbone (e.g., DiT) produces token-wise features, with a lightweight head generating all tokens in parallel.

DTM generalizes flow matching to discrete time and stochastic transitions. Notably, as the number of steps increases, DTM converges to deterministic flow matching, but with finite steps, its stochastic kernel provides greater expressiveness. Empirically, DTM achieves superior image quality and prompt alignment compared to flow matching, while requiring significantly fewer sampling steps (e.g., 32 vs. 128), resulting in a 7x speedup in inference.

2. Autoregressive Transition Matching (ARTM)

Supervising Process: Independent linear process, where each $X_{0,t}$ is sampled independently.
Parameterization: The model predicts $Y = X_{t+1}$ autoregressively, i.e., each token $Y^i$ is conditioned on $X_t$ and previous tokens $Y^{<i}$ .
Modeling: The transition kernel is modeled as an AR process, trained via flow matching.
Architecture: Similar to DTM, but the head operates autoregressively over tokens.

ARTM extends continuous AR modeling to multi-step transitions, overcoming the limitations of single-step AR diffusion. The independent linear supervising process is critical for regularizing the conditionals and avoiding degenerate solutions.

3. Full History Transition Matching (FHTM)

Supervising Process: Full history of states $(X_0, ..., X_t)$ from the independent linear process.
Parameterization: The model predicts $Y = X_{t+1}$ autoregressively, conditioned on the entire history and previous tokens.
Modeling: Fully causal AR kernel, enabling teacher-forcing during training.
Architecture: Fully causal transformer, compatible with standard LLM architectures.

FHTM is the first fully causal continuous-state generative model to match or surpass non-causal flow-based methods in image generation quality. Its causal structure allows seamless integration with multimodal LLMs, facilitating unified text and image generation.

Empirical Results

The paper conducts large-scale, controlled experiments on text-to-image generation using a fixed DiT backbone (1.7B parameters), identical data (350M Shutterstock image-caption pairs), and consistent training hyperparameters. Evaluation is performed on PartiPrompts and MS-COCO, using metrics such as CLIPScore, PickScore, ImageReward, UnifiedReward, Aesthetics, and DeQA.

Key findings:

DTM achieves the best overall performance across most metrics, with strong prompt alignment and image quality, and a substantial reduction in sampling steps and inference time compared to flow matching.
ARTM and FHTM (with 3 TM steps) match or exceed the performance of non-causal methods on prompt alignment and image quality, with FHTM enabling fully causal generation.
FHTM with LLM architecture matches or surpasses DiT-based models, demonstrating the viability of integrating TM with standard LLM architectures.
Ablations show that the independent linear supervising process is essential for AR kernels, and that DTM's performance saturates with relatively few transition and head steps.

Implementation Considerations

Sampling Efficiency: DTM requires only 16–32 backbone forward passes for high-quality generation, compared to 128 for flow matching, yielding significant speedups.
Kernel Expressiveness: DTM's parallel token generation is efficient but may limit expressiveness; larger patch sizes can improve performance for few steps.
Causal Modeling: FHTM's causal structure is compatible with LLMs, enabling unified multimodal generation and reasoning.
Scalability: All TM variants are demonstrated at scale, with robust performance across large datasets and model sizes.

Theoretical and Practical Implications

Transition Matching provides a unifying framework that subsumes diffusion, flow, and AR models as special cases, while enabling new, more expressive generative processes. The flexibility in supervising process, kernel parameterization, and modeling paradigm opens avenues for designing tailored generative models for diverse modalities and tasks.

Practically, TM enables:

Faster and higher-quality image generation with fewer sampling steps.
Causal, autoregressive generation in continuous state spaces, facilitating integration with LLMs and multimodal systems.
Flexible supervision and kernel design, allowing adaptation to new data types and tasks.

Future Directions

Potential research directions include:

Improved time schedulers and distillation for further acceleration and quality gains.
Multimodal integration, leveraging FHTM's compatibility with LLMs for unified text, image, and audio generation.
Extension to other domains, such as video, audio, and structured data, exploiting TM's generality.
Exploration of alternative supervising processes and kernel parameterizations for domain-specific generative modeling.

Conclusion

Transition Matching establishes a principled, scalable, and flexible foundation for generative modeling, bridging and extending the capabilities of diffusion, flow, and autoregressive models. Its empirical and theoretical contributions provide a robust basis for future advances in high-fidelity, efficient, and unified generative systems.