- The paper introduces Transition Matching, a unified framework that generalizes diffusion, flow matching, and autoregressive models to enhance generative modeling.
- It presents three variants—Difference, Autoregressive, and Full History Transition Matching—that optimize kernel parameterization and supervision for improved quality and speed.
- Empirical results on large-scale text-to-image tasks show significant speedups and superior prompt alignment, highlighting the framework’s practical scalability.
Transition Matching: A Unified and Scalable Paradigm for Generative Modeling
The paper "Transition Matching: Scalable and Flexible Generative Modeling" (2506.23589) introduces Transition Matching (TM), a discrete-time, continuous-state generative modeling framework that unifies and generalizes diffusion, flow matching, and continuous autoregressive (AR) models. TM is designed to address limitations in the design space of current generative models, offering new flexibility in kernel parameterization, supervision processes, and modeling paradigms. The authors present three concrete TM variants—Difference Transition Matching (DTM), Autoregressive Transition Matching (ARTM), and Full History Transition Matching (FHTM)—and demonstrate their effectiveness on large-scale text-to-image generation tasks.
Framework Overview
Transition Matching models the data generation process as a Markov chain, parameterized by a transition kernel pt+1∣tθ(xt+1∣xt), which maps an initial easy-to-sample distribution p0 to a target data distribution pT over T discrete steps. The key innovation is the use of a flexible supervising process q0,…,T, which allows for arbitrary (including non-continuous) processes, and the ability to learn expressive, potentially non-deterministic transition kernels.
The training objective is to match the model's transition kernel to the supervising process's conditional transitions, using a divergence D that admits an empirical (sample-based) form. This enables efficient training via stochastic optimization, even when the true conditional distributions are intractable.
TM Variants and Design Choices
The paper systematically explores the TM design space along three axes: the supervising process, kernel parameterization, and kernel modeling. The main variants are:
- Difference Transition Matching (DTM): Generalizes flow matching to discrete time by learning the full transition distribution of the difference Y=XT−X0 given Xt. DTM uses a linear supervising process (conditional optimal transport) and a parallelized architecture for efficient sampling. Notably, DTM's expected transition coincides with the Euler step of flow matching, and as the number of steps increases, DTM converges to deterministic flow matching.
- Autoregressive Transition Matching (ARTM): Extends continuous AR models by using an independent linear supervising process, where each Xt is constructed from independent noise and the target XT. The transition kernel is modeled autoregressively over image tokens, allowing for causal generation and integration with text AR models.
- Full History Transition Matching (FHTM): Further generalizes ARTM by conditioning each transition on the full history of previous states, enabling full teacher-forcing during training. FHTM is the first fully causal model to match or surpass flow-based methods in continuous domains.
The authors provide detailed algorithmic descriptions and code-level pseudocode for each variant, facilitating practical implementation.
Empirical Results
Extensive experiments are conducted on large-scale text-to-image generation benchmarks (PartiPrompts, MS-COCO, GenEval), using a fixed DiT backbone (1.7B parameters) and standardized training protocols. Key findings include:
- DTM achieves state-of-the-art image quality and text alignment, outperforming both flow matching and AR baselines across most metrics, while requiring significantly fewer sampling steps (e.g., 32 vs. 256 for FM), resulting in a 7x speedup in inference time.
- ARTM and FHTM match or exceed the performance of non-causal methods in terms of image quality and prompt adherence, with FHTM enabling seamless integration with LLM architectures for multimodal generation.
- Kernel expressiveness and supervising process design are critical: The independent linear process is essential for ARTM/FHTM performance, and increasing the expressiveness of the DTM kernel (e.g., larger patch sizes) improves performance for low step counts.
- Sampling efficiency: DTM allows for aggressive reduction in the number of backbone and head evaluations without significant loss in quality, a property not shared by ARTM/FHTM due to their sequential nature.
Implementation Considerations
- Architecture: All variants leverage a DiT backbone with a lightweight flow head for velocity prediction. For ARTM/FHTM, the head operates autoregressively over tokens, while DTM uses parallel prediction for efficiency.
- Training: The loss is based on flow matching, with empirical divergence computed via sample pairs from the supervising process. Classifier-free guidance is supported for improved text alignment.
- Sampling: DTM supports efficient parallel sampling, while ARTM/FHTM require sequential token generation, impacting inference speed.
- Resource Requirements: Large-scale training (350M image-caption pairs, 1.7B parameter models) is necessary to achieve reported results. However, the modularity of TM allows adaptation to smaller scales or other modalities.
Theoretical and Practical Implications
Transition Matching provides a unifying framework that subsumes diffusion, flow, and AR models as special cases, offering a principled approach to designing new generative models with tailored trade-offs between quality, speed, and causality. The empirical results demonstrate that more expressive transition kernels and flexible supervision processes can yield substantial gains in both sample quality and efficiency.
The causal TM variants (ARTM, FHTM) open the door to fully integrated multimodal generative models, where text and image generation can be handled within a single, unified AR framework. This is particularly relevant for next-generation foundation models that require seamless cross-modal reasoning and generation.
Future Directions
Potential avenues for further research include:
- Time scheduling and distillation: Optimizing the transition schedule or distilling multi-step TM models into single-step or fewer-step models for even faster inference.
- Multimodal integration: Leveraging FHTM within large-scale multimodal systems, enabling joint text, image, and potentially audio/video generation.
- Kernel expressiveness: Exploring richer kernel parameterizations, including attention-based or hierarchical kernels, to further improve sample diversity and fidelity.
- Resource scaling: Adapting TM to resource-constrained settings or alternative modalities (e.g., video, 3D, audio).
Conclusion
Transition Matching represents a significant step in the evolution of generative modeling, providing a flexible, scalable, and theoretically grounded framework that unifies and extends existing paradigms. The demonstrated improvements in sample quality, prompt alignment, and sampling efficiency, along with the potential for seamless multimodal integration, position TM as a foundational approach for future generative systems.