Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transition Matching: Scalable and Flexible Generative Modeling (2506.23589v1)

Published 30 Jun 2025 in cs.LG and cs.AI

Abstract: Diffusion and flow matching models have significantly advanced media generation, yet their design space is well-explored, somewhat limiting further improvements. Concurrently, autoregressive (AR) models, particularly those generating continuous tokens, have emerged as a promising direction for unifying text and media generation. This paper introduces Transition Matching (TM), a novel discrete-time, continuous-state generative paradigm that unifies and advances both diffusion/flow models and continuous AR generation. TM decomposes complex generation tasks into simpler Markov transitions, allowing for expressive non-deterministic probability transition kernels and arbitrary non-continuous supervision processes, thereby unlocking new flexible design avenues. We explore these choices through three TM variants: (i) Difference Transition Matching (DTM), which generalizes flow matching to discrete-time by directly learning transition probabilities, yielding state-of-the-art image quality and text adherence as well as improved sampling efficiency. (ii) Autoregressive Transition Matching (ARTM) and (iii) Full History Transition Matching (FHTM) are partially and fully causal models, respectively, that generalize continuous AR methods. They achieve continuous causal AR generation quality comparable to non-causal approaches and potentially enable seamless integration with existing AR text generation techniques. Notably, FHTM is the first fully causal model to match or surpass the performance of flow-based methods on text-to-image task in continuous domains. We demonstrate these contributions through a rigorous large-scale comparison of TM variants and relevant baselines, maintaining a fixed architecture, training data, and hyperparameters.

Summary

  • The paper introduces Transition Matching, a unified framework that generalizes diffusion, flow matching, and autoregressive models to enhance generative modeling.
  • It presents three variants—Difference, Autoregressive, and Full History Transition Matching—that optimize kernel parameterization and supervision for improved quality and speed.
  • Empirical results on large-scale text-to-image tasks show significant speedups and superior prompt alignment, highlighting the framework’s practical scalability.

Transition Matching: A Unified and Scalable Paradigm for Generative Modeling

The paper "Transition Matching: Scalable and Flexible Generative Modeling" (2506.23589) introduces Transition Matching (TM), a discrete-time, continuous-state generative modeling framework that unifies and generalizes diffusion, flow matching, and continuous autoregressive (AR) models. TM is designed to address limitations in the design space of current generative models, offering new flexibility in kernel parameterization, supervision processes, and modeling paradigms. The authors present three concrete TM variants—Difference Transition Matching (DTM), Autoregressive Transition Matching (ARTM), and Full History Transition Matching (FHTM)—and demonstrate their effectiveness on large-scale text-to-image generation tasks.

Framework Overview

Transition Matching models the data generation process as a Markov chain, parameterized by a transition kernel pt+1tθ(xt+1xt)p_{t+1|t}^\theta(x_{t+1}|x_t), which maps an initial easy-to-sample distribution p0p_0 to a target data distribution pTp_T over TT discrete steps. The key innovation is the use of a flexible supervising process q0,,Tq_{0,\ldots,T}, which allows for arbitrary (including non-continuous) processes, and the ability to learn expressive, potentially non-deterministic transition kernels.

The training objective is to match the model's transition kernel to the supervising process's conditional transitions, using a divergence DD that admits an empirical (sample-based) form. This enables efficient training via stochastic optimization, even when the true conditional distributions are intractable.

TM Variants and Design Choices

The paper systematically explores the TM design space along three axes: the supervising process, kernel parameterization, and kernel modeling. The main variants are:

  • Difference Transition Matching (DTM): Generalizes flow matching to discrete time by learning the full transition distribution of the difference Y=XTX0Y = X_T - X_0 given XtX_t. DTM uses a linear supervising process (conditional optimal transport) and a parallelized architecture for efficient sampling. Notably, DTM's expected transition coincides with the Euler step of flow matching, and as the number of steps increases, DTM converges to deterministic flow matching.
  • Autoregressive Transition Matching (ARTM): Extends continuous AR models by using an independent linear supervising process, where each XtX_t is constructed from independent noise and the target XTX_T. The transition kernel is modeled autoregressively over image tokens, allowing for causal generation and integration with text AR models.
  • Full History Transition Matching (FHTM): Further generalizes ARTM by conditioning each transition on the full history of previous states, enabling full teacher-forcing during training. FHTM is the first fully causal model to match or surpass flow-based methods in continuous domains.

The authors provide detailed algorithmic descriptions and code-level pseudocode for each variant, facilitating practical implementation.

Empirical Results

Extensive experiments are conducted on large-scale text-to-image generation benchmarks (PartiPrompts, MS-COCO, GenEval), using a fixed DiT backbone (1.7B parameters) and standardized training protocols. Key findings include:

  • DTM achieves state-of-the-art image quality and text alignment, outperforming both flow matching and AR baselines across most metrics, while requiring significantly fewer sampling steps (e.g., 32 vs. 256 for FM), resulting in a 7x speedup in inference time.
  • ARTM and FHTM match or exceed the performance of non-causal methods in terms of image quality and prompt adherence, with FHTM enabling seamless integration with LLM architectures for multimodal generation.
  • Kernel expressiveness and supervising process design are critical: The independent linear process is essential for ARTM/FHTM performance, and increasing the expressiveness of the DTM kernel (e.g., larger patch sizes) improves performance for low step counts.
  • Sampling efficiency: DTM allows for aggressive reduction in the number of backbone and head evaluations without significant loss in quality, a property not shared by ARTM/FHTM due to their sequential nature.

Implementation Considerations

  • Architecture: All variants leverage a DiT backbone with a lightweight flow head for velocity prediction. For ARTM/FHTM, the head operates autoregressively over tokens, while DTM uses parallel prediction for efficiency.
  • Training: The loss is based on flow matching, with empirical divergence computed via sample pairs from the supervising process. Classifier-free guidance is supported for improved text alignment.
  • Sampling: DTM supports efficient parallel sampling, while ARTM/FHTM require sequential token generation, impacting inference speed.
  • Resource Requirements: Large-scale training (350M image-caption pairs, 1.7B parameter models) is necessary to achieve reported results. However, the modularity of TM allows adaptation to smaller scales or other modalities.

Theoretical and Practical Implications

Transition Matching provides a unifying framework that subsumes diffusion, flow, and AR models as special cases, offering a principled approach to designing new generative models with tailored trade-offs between quality, speed, and causality. The empirical results demonstrate that more expressive transition kernels and flexible supervision processes can yield substantial gains in both sample quality and efficiency.

The causal TM variants (ARTM, FHTM) open the door to fully integrated multimodal generative models, where text and image generation can be handled within a single, unified AR framework. This is particularly relevant for next-generation foundation models that require seamless cross-modal reasoning and generation.

Future Directions

Potential avenues for further research include:

  • Time scheduling and distillation: Optimizing the transition schedule or distilling multi-step TM models into single-step or fewer-step models for even faster inference.
  • Multimodal integration: Leveraging FHTM within large-scale multimodal systems, enabling joint text, image, and potentially audio/video generation.
  • Kernel expressiveness: Exploring richer kernel parameterizations, including attention-based or hierarchical kernels, to further improve sample diversity and fidelity.
  • Resource scaling: Adapting TM to resource-constrained settings or alternative modalities (e.g., video, 3D, audio).

Conclusion

Transition Matching represents a significant step in the evolution of generative modeling, providing a flexible, scalable, and theoretically grounded framework that unifies and extends existing paradigms. The demonstrated improvements in sample quality, prompt alignment, and sampling efficiency, along with the potential for seamless multimodal integration, position TM as a foundational approach for future generative systems.

Youtube Logo Streamline Icon: https://streamlinehq.com