- The paper introduces Transition Matching as a unified paradigm that generalizes diffusion, flow matching, and autoregressive models for generative tasks.
- It presents three variants (DTM, ARTM, FHTM) that achieve state-of-the-art image quality and a significant reduction in sampling steps, with DTM offering up to 7x speedup.
- The framework’s flexible supervision and kernel design enable scalable, multimodal generation compatible with standard LLM architectures.
Transition Matching: A Unified and Scalable Paradigm for Generative Modeling
"Transition Matching: Scalable and Flexible Generative Modeling" (2506.23589) introduces Transition Matching (TM), a discrete-time, continuous-state generative modeling framework that unifies and generalizes diffusion, flow matching, and continuous autoregressive (AR) models. The paper systematically explores the design space of TM, presenting new algorithmic variants that achieve state-of-the-art performance in text-to-image generation, with significant improvements in sample quality, prompt alignment, and sampling efficiency.
Framework Overview
Transition Matching is formulated as learning a Markov process with parameterized transition kernels pt+1∣tθ(xt+1∣xt), mapping from a simple source distribution p0 to a complex target distribution pT over T discrete steps. The process is supervised by a flexible "supervising process" q0,…,T, which can be arbitrarily chosen as long as its marginal at t=0 matches p0. This generality enables the use of non-continuous, non-Gaussian, and highly expressive transition kernels, in contrast to the restrictive choices in standard diffusion or flow models.
The TM loss is defined as a divergence between the model transition kernel and the supervising process kernel, with the requirement that the divergence admits an empirical, sample-based form. This allows for efficient training via stochastic optimization.
Key TM Variants
The paper introduces three principal TM instantiations, each corresponding to a different modeling paradigm:
1. Difference Transition Matching (DTM)
- Supervising Process: Standard linear (conditional optimal transport) process.
- Parameterization: The model predicts the difference Y=XT−X0 given Xt.
- Modeling: The transition kernel is learned via flow matching, regressing the full conditional distribution of Y.
- Architecture: A large backbone (e.g., DiT) produces token-wise features, with a lightweight head generating all tokens in parallel.
DTM generalizes flow matching to discrete time and stochastic transitions. Notably, as the number of steps increases, DTM converges to deterministic flow matching, but with finite steps, its stochastic kernel provides greater expressiveness. Empirically, DTM achieves superior image quality and prompt alignment compared to flow matching, while requiring significantly fewer sampling steps (e.g., 32 vs. 128), resulting in a 7x speedup in inference.
2. Autoregressive Transition Matching (ARTM)
- Supervising Process: Independent linear process, where each X0,t is sampled independently.
- Parameterization: The model predicts Y=Xt+1 autoregressively, i.e., each token Yi is conditioned on Xt and previous tokens Y<i.
- Modeling: The transition kernel is modeled as an AR process, trained via flow matching.
- Architecture: Similar to DTM, but the head operates autoregressively over tokens.
ARTM extends continuous AR modeling to multi-step transitions, overcoming the limitations of single-step AR diffusion. The independent linear supervising process is critical for regularizing the conditionals and avoiding degenerate solutions.
3. Full History Transition Matching (FHTM)
- Supervising Process: Full history of states (X0,...,Xt) from the independent linear process.
- Parameterization: The model predicts Y=Xt+1 autoregressively, conditioned on the entire history and previous tokens.
- Modeling: Fully causal AR kernel, enabling teacher-forcing during training.
- Architecture: Fully causal transformer, compatible with standard LLM architectures.
FHTM is the first fully causal continuous-state generative model to match or surpass non-causal flow-based methods in image generation quality. Its causal structure allows seamless integration with multimodal LLMs, facilitating unified text and image generation.
Empirical Results
The paper conducts large-scale, controlled experiments on text-to-image generation using a fixed DiT backbone (1.7B parameters), identical data (350M Shutterstock image-caption pairs), and consistent training hyperparameters. Evaluation is performed on PartiPrompts and MS-COCO, using metrics such as CLIPScore, PickScore, ImageReward, UnifiedReward, Aesthetics, and DeQA.
Key findings:
- DTM achieves the best overall performance across most metrics, with strong prompt alignment and image quality, and a substantial reduction in sampling steps and inference time compared to flow matching.
- ARTM and FHTM (with 3 TM steps) match or exceed the performance of non-causal methods on prompt alignment and image quality, with FHTM enabling fully causal generation.
- FHTM with LLM architecture matches or surpasses DiT-based models, demonstrating the viability of integrating TM with standard LLM architectures.
- Ablations show that the independent linear supervising process is essential for AR kernels, and that DTM's performance saturates with relatively few transition and head steps.
Implementation Considerations
- Sampling Efficiency: DTM requires only 16–32 backbone forward passes for high-quality generation, compared to 128 for flow matching, yielding significant speedups.
- Kernel Expressiveness: DTM's parallel token generation is efficient but may limit expressiveness; larger patch sizes can improve performance for few steps.
- Causal Modeling: FHTM's causal structure is compatible with LLMs, enabling unified multimodal generation and reasoning.
- Scalability: All TM variants are demonstrated at scale, with robust performance across large datasets and model sizes.
Theoretical and Practical Implications
Transition Matching provides a unifying framework that subsumes diffusion, flow, and AR models as special cases, while enabling new, more expressive generative processes. The flexibility in supervising process, kernel parameterization, and modeling paradigm opens avenues for designing tailored generative models for diverse modalities and tasks.
Practically, TM enables:
- Faster and higher-quality image generation with fewer sampling steps.
- Causal, autoregressive generation in continuous state spaces, facilitating integration with LLMs and multimodal systems.
- Flexible supervision and kernel design, allowing adaptation to new data types and tasks.
Future Directions
Potential research directions include:
- Improved time schedulers and distillation for further acceleration and quality gains.
- Multimodal integration, leveraging FHTM's compatibility with LLMs for unified text, image, and audio generation.
- Extension to other domains, such as video, audio, and structured data, exploiting TM's generality.
- Exploration of alternative supervising processes and kernel parameterizations for domain-specific generative modeling.
Conclusion
Transition Matching establishes a principled, scalable, and flexible foundation for generative modeling, bridging and extending the capabilities of diffusion, flow, and autoregressive models. Its empirical and theoretical contributions provide a robust basis for future advances in high-fidelity, efficient, and unified generative systems.