CMT: Mid-Training for Efficient Learning of Consistency, Mean Flow, and Flow Map Models (2509.24526v1)
Abstract: Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation by learning the long jump of the ODE solution of diffusion models, yet training remains unstable, sensitive to hyperparameters, and costly. Initializing from a pre-trained diffusion model helps, but still requires converting infinitesimal steps into a long-jump map, leaving instability unresolved. We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage between the (diffusion) pre-training and the final flow map training (i.e., post-training) for vision generation. Concretely, Consistency Mid-Training (CMT) is a compact and principled stage that trains a model to map points along a solver trajectory from a pre-trained model, starting from a prior sample, directly to the solver-generated clean sample. It yields a trajectory-consistent and stable initialization. This initializer outperforms random and diffusion-based baselines and enables fast, robust convergence without heuristics. Initializing post-training with CMT weights further simplifies flow map learning. Empirically, CMT achieves state of the art two step FIDs: 1.97 on CIFAR-10, 1.32 on ImageNet 64x64, and 1.84 on ImageNet 512x512, while using up to 98% less training data and GPU time, compared to CMs. On ImageNet 256x256, CMT reaches 1-step FID 3.34 while cutting total training time by about 50% compared to MF from scratch (FID 3.43). This establishes CMT as a principled, efficient, and general framework for training flow map models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper is about making AI image generators much faster and easier to train without losing image quality. It introduces a simple extra step, called Consistency Mid-Training (CMT), that sits between using an existing model and training a new, super-fast generator. With CMT, the authors get top-quality images in just 1–2 steps, while using far less compute and time than before.
Why was this needed?
Many of today’s best image generators are “diffusion models.” They make pictures by slowly improving random noise into a detailed image, step by step. This looks great, but it’s slow because it often needs tens or hundreds of steps.
A newer idea is to learn “shortcuts” that jump most of the way in just 1–2 steps, but training these shortcut models has been unstable and expensive. Even starting from a good diffusion model didn’t fully fix the problem because diffusion models learn tiny steps, while shortcut models must learn big jumps. That mismatch makes learning shaky.
What are the main questions the paper asks?
- Can we make training of few-step (shortcut) image generators stable, fast, and cheap?
- Can we start from a good place so the shortcut model doesn’t wobble or collapse during training?
- Can one idea work for different shortcut model families (like Consistency Models and Mean Flow)?
How does their method (CMT) work?
Think of image generation like traveling from a messy starting point (random noise) to a clean destination (a sharp image). A diffusion model is like walking the path with many small steps. A “flow map” or “shortcut” model tries to leap most of the path in 1–2 big jumps.
CMT adds a short “practice” stage in the middle:
- Stage 1: Pre-training (Teacher)
- Use a good existing model (the teacher) to plot a path from noise to a clean image. This path is a sequence of states that starts noisy and gets cleaner.
- Stage 2: Mid-training (CMT)
- Train a new model (the student) to do something simple but powerful: from any point on the teacher’s path, jump straight to the final clean image (for Consistency Models), or learn the average “speed” between two points (for Mean Flow).
- Important: the targets are fixed and come from the teacher, so training is stable—no moving targets or tricky rules.
- Stage 3: Post-training (Final Shortcut Model)
- Now that the student already understands the path, finish training the final 1–2 step shortcut model quickly and reliably. Because the student is already “path-aware,” it learns faster and better.
Two versions of CMT:
- For Consistency Models (CM): the student learns to jump from any intermediate point directly to the clean image the teacher reaches.
- For Mean Flow (MF): the student learns the average motion between two points on the path (like distance divided by time), using simple differences between teacher states.
In everyday terms: the teacher traces the route; the student practices jumping to the destination from any waypoint on that exact route. Later, the student becomes great at fast travel with very little extra training.
What did they find?
The authors show big gains in both quality and efficiency:
- State-of-the-art 2-step image quality (lower FID is better):
- CIFAR-10: 1.97
- ImageNet 64×64: 1.32
- ImageNet 512×512: 1.84
- Huge savings in training:
- Up to 98% less training data and GPU time than previous methods.
- On ImageNet 512×512, they cut training time by about 91% and still beat previous 2-step results.
- On ImageNet 256×256, they reached similar or better quality while using about half the total training time compared to training from scratch.
Why this matters:
- Faster training: Saves money, energy, and time.
- More stable: Less likely to crash or need tricky tuning.
- General: Works for multiple shortcut model types (Consistency Models and Mean Flow), and even allows smaller or different teachers.
How is this different from previous approaches?
- Previous shortcut training often used “moving targets” (the goal changes as the model trains), which can confuse learning and require many special tricks.
- CMT uses fixed, high-quality targets from a teacher’s path, which is simpler and much more stable.
- Starting from a diffusion model alone (without CMT) still leaves a mismatch: diffusion models learn tiny steps, but shortcut models must learn big jumps. CMT bridges that gap.
The authors also provide theory showing that CMT puts the model closer to the “right direction” for learning, so gradients (the updates during training) are more accurate and less biased.
What’s the potential impact?
- Practical speed-ups: High-quality images in 1–2 steps make fast image generation more affordable and accessible.
- Lower compute footprint: Less GPU time and training data needed can reduce environmental and financial costs.
- Broad use: The idea is architecture-agnostic and can help many ODE-based generative models, not just one specific setup.
- Better reliability: Easier training with fewer hacks makes research and production workflows smoother.
In short
CMT is like giving a student driver a GPS playback of a perfect route and having them practice jumping to the destination from any point on that route. After this practice, the student can drive directly and confidently with just one or two moves. This makes training faster, cheaper, and more stable—while keeping image quality at the top tier.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of concrete gaps, uncertainties, and open questions left unresolved by the paper that future research could address:
- Scope beyond images:
- Does CMT transfer to other modalities (audio, video, 3D, molecules) and conditioning settings (e.g., text-to-image)? No experiments or adaptations are reported for SDE-based diffusion samplers, multimodal conditioning, or cross-modal tasks.
- Teacher dependence and bias propagation:
- How sensitive is CMT to teacher quality and biases? Under what conditions can CMT-enabled post-training surpass the teacher’s distribution support, diversity, or artifact profile rather than inherit them?
- What are the failure modes when the teacher is extremely weak, miscalibrated, or trained on a slightly different data distribution (domain shift)?
- How does mismatch in noise schedules or parameterizations between teacher and student (e.g., EDM vs FM schedules) affect mid-training alignment and final performance?
- Trajectory and solver design:
- How do the discretization schedule, number of solver steps, and solver choice (e.g., DPM-Solver++ vs. adaptive ODE solvers) affect mid-training stability, bias, and compute cost?
- Can adaptive or learned discretization improve trajectory coverage and reduce teacher-solver error without increasing cost?
- What is the impact of using stochastic SDE trajectories (vs. deterministic PF-ODE) as mid-training targets?
- Loss design and metric alignment:
- The paper mixes perceptual losses (LPIPS/ELatentLPIPS) and L2 across settings without a principled selection rule. How do these choices affect diversity, mode coverage, and overfitting to the teacher’s visual characteristics?
- Does mid-training with perceptual losses degrade likelihood calibration or increase mode dropping relative to pixel/feature L2?
- Are there better target spaces (e.g., multi-scale features, diffusion feature spaces) that improve generalization while limiting bias to the teacher?
- Post-training necessity and minimality:
- Can the mid-trained model alone (without post-training) achieve competitive few-step generation? What is the minimal post-training required for different datasets/scales to reach target FIDs?
- Is there a unified post-training objective that consistently benefits from CMT initialization across CM/MF variants without dataset-specific heuristics?
- Theory–practice gaps:
- The gradient-bias analysis is given for CM with uniform and squared L2, small , and near-zero mid-training error. How tight are these bounds in practice with finite steps, non-uniform weightings, perceptual losses, and large models?
- There is no non-asymptotic sample-complexity or generalization bound linking mid-training error to final generation quality; constants and dependence on model capacity, solver error, and dataset complexity remain unspecified.
- The theoretical treatment focuses on CM; analogous formal results for MF (and other flow-map parameterizations) are not provided.
- Generalization and robustness:
- How robust is CMT to distribution shift (e.g., domain adaptation, long-tail classes in ImageNet)? No out-of-distribution or robustness tests are reported.
- Does CMT affect robustness to perturbations or adversarial noise compared to standard CM/MF baselines?
- Compute accounting and efficiency:
- The paper measures “data cost” in processed images and reports GPU hours, but the overhead for generating teacher trajectories (e.g., wall-clock and energy for 16-step solvers) is not fully disentangled from training time. A fuller accounting (including I/O and caching strategies) is missing.
- How does CMT scale with model size and resolution beyond 512×512, multi-node training, or pipeline parallelism? Are there emerging bottlenecks (memory, communication) at larger scales?
- Architectural agnosticism:
- Claims of architecture-agnostic mid-training are not tested beyond U-Net/DiT-like backbones and CM/MF objectives. Does CMT benefit GAN-style generators, rectified flows, or newer one-step transformer-based models?
- Can CMT bridge heterogeneous teacher–student pairs (e.g., teacher with DiT + EDM, student with U-Net + FM) reliably, and what alignment tools (time warping, feature adapters) are necessary?
- Trajectory coverage and diversity:
- The method trains on teacher-generated trajectories starting from . How many unique trajectories are required for coverage, and how does trajectory sampling policy affect sample quality/diversity?
- Are there benefits to curriculum strategies (e.g., emphasizing hard time intervals or difficult classes) during mid-training?
- Inheritance vs improvement:
- Under what conditions does CMT enable students to outperform teacher FID while avoiding teacher’s systematic errors (e.g., texture biases, color shifts)? Formal criteria or diagnostics to detect and correct teacher-induced biases are not provided.
- Compatibility with SDE-based samplers:
- CMT assumes deterministic PF-ODE trajectories; extending to stochastic samplers (variance-preserving or variance-exploding SDEs) and analyzing noise-averaged targets remains unexplored.
- Evaluation breadth:
- The paper primarily reports FID. There is no analysis of precision–recall trade-offs, coverage metrics, density calibration, memorization/nearest-neighbor checks, or human preference studies, leaving open the effects on diversity and fidelity beyond FID.
- Privacy and safety:
- Regressing to teacher outputs could amplify training-data leakage or bias; privacy leakage analyses (e.g., membership inference) and fairness audits are absent.
- Latent-autoencoder interactions:
- For high-resolution experiments, mid-training occurs in SD latent space. The impact of VAE reconstruction error and latent perceptual loss on downstream fidelity and artifacts is not analyzed; no ablation across autoencoders is provided.
- Hyperparameter sensitivity:
- Although CMT reduces heuristic dependence, key choices (e.g., number of mid-training steps, time grid, solver steps, loss type, feature networks for perceptual loss) lack systematic sensitivity analyses.
- Weak-teacher regime:
- While a small MF-B/4 teacher helps XL/2, it is unclear how weak the teacher can be before CMT harms convergence or caps final quality. Thresholds and diagnostics for acceptable teacher strength are not given.
- Schedule and parameterization mismatch:
- When teacher and student use different time schedules/noise parameterizations, what is the best way to align them (e.g., reparameterization, time warping)? The paper does not paper alignment strategies or their errors.
- Data regimes:
- Performance under low-data or class-imbalanced regimes is untested. Does CMT still stabilize training and retain data-efficiency when labeled or unlabeled data is scarce?
- Long-horizon tasks:
- Are there benefits of CMT for very long integration horizons (e.g., starting from extremely high , or in tasks requiring very large “jumps” such as one-step generation at megapixel scales)? Stability and quality at such extremes are not demonstrated.
- Continual and domain-adaptive training:
- Can CMT be used for continual learning or rapid domain adaptation by reusing teacher trajectories from prior domains and mixing them with new-domain data? No experiments or methodology are provided.
- Open-source reproducibility:
- While code is released, the paper lacks detailed reproducibility reports for all settings (random seeds, hardware variability, exact configs), making it hard to assess stability across runs and labs.
Practical Applications
Immediate Applications
The following applications can be deployed now by integrating the paper’s mid-training (CMT) stage into existing diffusion or flow-map pipelines, leveraging the released code and established ODE solvers.
- Cost-efficient upgrade for existing image-generation pipelines (software, media, advertising)
- Use case: Add CMT between pre-trained diffusion weights and final flow-map post-training (e.g., ECT/ECD/MF) to obtain 1–2 step generators with comparable or better quality.
- Tools/products/workflows: “CMT initializer” for PyTorch; DPM-Solver++ teacher with 16 steps; LPIPS/ELatentLPIPS regression losses; simplified post-training (no time-weighting or stop-gradients).
- Impact: 90%+ reduction in GPU hours and up to 98% reduction in training images; faster convergence; fewer heuristics.
- Assumptions/dependencies: Availability and licensing of a high-quality teacher sampler (EDM/EDM2/SD or small MF); teacher-solver scheduling alignment; dataset fit to target domain.
- On-device, low-latency image synthesis and editing (consumer software, mobile, AR)
- Use case: Deploy 1–2 step generators for photo filters, background removal, local stylization on phones/tablets.
- Tools/products/workflows: Latent-space CMT with SD autoencoders; quantization/distillation for mobile; caching or precomputation of embeddings.
- Impact: Millisecond-level inference with minimal battery drain; privacy by local generation.
- Assumptions/dependencies: Model size and memory constraints; on-device acceleration libraries; end-user latency requirements.
- Rapid domain adaptation with small data budgets (healthcare, retail, automotive)
- Use case: Fine-tune CMT-initialized models to specialized imagery (e.g., medical modalities, catalog products, driving scenes) with significantly fewer images and GPU hours.
- Tools/products/workflows: CMT initialization + LPIPS losses for domain fidelity; trajectory generation from general-purpose teachers; lightweight post-training to target FIDs.
- Impact: Stable convergence, reduced compute, faster deployment of domain-specific models.
- Assumptions/dependencies: Quality/availability of teacher trajectories in the target modality; regulatory compliance (especially for medical data); validation beyond FID (clinical or task-specific metrics).
- Synthetic data generation at scale for training downstream models (robotics, CV, e-commerce)
- Use case: Produce large volumes of diverse images quickly for augmentation, simulation, and A/B testing.
- Tools/products/workflows: 1–2 step generators powered by CMT; automated pipelines for sampling, labeling, and distribution-shift checks; dataset curation dashboards.
- Impact: Accelerates dataset creation; reduces cost and energy; improves iteration speed.
- Assumptions/dependencies: Representativeness of synthetic data; propagation of teacher biases; guardrails for content quality and safety.
- Cloud AI training services offering “mid-trained initializers” (cloud providers, MLOps)
- Use case: Managed service that delivers CMT initializers, trajectory caching, and turnkey post-training for customer models.
- Tools/products/workflows: APIs for uploading teacher checkpoints; trajectory generation and storage formats; automated post-training orchestration.
- Impact: Lowers customer compute bills; standardizes efficient few-step training.
- Assumptions/dependencies: Secure handling of customer data/models; legal/contractual terms around teacher licensing; multi-tenant scheduling.
- Bootstrapping larger MF models with small teachers (research/engineering)
- Use case: Use weak MF-B/4 teachers to mid-train and then post-train MF-XL/2, halving total training time while improving final FID.
- Tools/products/workflows: Teacher-agnostic mid-training; fixed discretization; regression to finite differences (MF).
- Impact: Practical path to scale MF without lengthy pre-training of large teachers.
- Assumptions/dependencies: Scheduler alignment; robustness to low-quality teachers; consistent loss parameterization.
- Sustainability gains and reporting (energy, corporate sustainability)
- Use case: Adopt CMT to reduce energy consumption and emissions in training pipelines.
- Tools/products/workflows: Compute cost trackers; carbon accounting dashboards; sustainability KPIs for model training.
- Impact: Immediate energy savings (e.g., 90%+ GPU-hour reduction in large-scale training).
- Assumptions/dependencies: Accurate energy measurement; data-center reporting; differing baseline compute footprints.
- Lower compute barrier for academic research and teaching (academia)
- Use case: Reproducible, stable experiments on flow-map models with minimal GPUs; course projects using CMT initializers.
- Tools/products/workflows: Open-source CMT code/models; standardized recipes; trajectory generation with DPM-Solver++.
- Impact: Wider access to SOTA generative modeling; fewer heuristics to debug; consistent baselines.
- Assumptions/dependencies: Access to teacher checkpoints; curriculum integration; shared evaluation datasets.
Long-Term Applications
These applications require further research, scaling, or adaptation of CMT to new modalities, objectives, or deployment contexts.
- Multimodal and video generative modeling with few steps (entertainment, AR/VR, robotics simulation)
- Use case: Extend CMT to audio, video, 3D, and text-conditioned generative tasks; enable near real-time video synthesis with few-step flows.
- Tools/products/workflows: PF-ODE formulations for non-image modalities; trajectory losses compatible with perceptual metrics in audio/video; teacher-specific solvers.
- Dependencies: New scheduler designs; robust temporal consistency losses; high-quality multimodal teachers; large-scale validation beyond FID.
- Efficient generative world models for robotics and autonomous systems (robotics, automotive)
- Use case: Real-time scene generation or simulation for planning/perception, leveraging few-step flows for speed.
- Tools/products/workflows: World-model trajectories; integration with control loops; closed-loop evaluation frameworks.
- Dependencies: Safety/reliability guarantees; adaptation to sequential/causal dynamics rather than independent frames; robust metrics.
- Privacy-preserving medical data synthesis pipelines (healthcare, public health)
- Use case: Build regulated workflows where CMT reduces compute while producing clinically useful synthetic images for rare conditions or data sharing.
- Tools/products/workflows: Differential privacy or auditing layers; clinical utility and bias evaluation; governance-ready documentation.
- Dependencies: Regulatory validation; rigorous utility/security trade-offs; institution buy-in; domain-specific quality metrics.
- Standardized, compute-efficient training policies and carbon labeling (policy, government, standards bodies)
- Use case: Encourage or mandate efficiency-first training protocols; establish carbon labels for generative models trained with CMT-like stages.
- Tools/products/workflows: Policy frameworks; reporting templates; independent audits.
- Dependencies: Broad stakeholder consensus; standard metrics for quality per compute; compliance infrastructure.
- Turnkey tooling for CMT adoption (software tooling ecosystem)
- Use case: “CMT-as-a-library” with automatic teacher selection, trajectory caching formats, and post-training optimizers; SDKs for PyTorch/TF and Stable Diffusion ecosystems.
- Tools/products/workflows: Workflow orchestrators; dataset/trajectory registries; monitoring and early-stopping based on trajectory alignment.
- Dependencies: Community adoption; maintenance and versioning across teachers; compatibility with emerging model backbones.
- Personalized on-device generative assistants (consumer, education)
- Use case: Private, fast content generation tailored to user data on-device (avatars, illustrations, paper aids).
- Tools/products/workflows: Lightweight latent models; personalization layers; safety filters.
- Dependencies: Memory footprint constraints; user-level customization while avoiding memorization; fairness controls.
- Better evaluation and benchmarking beyond FID (academia, industry)
- Use case: Develop task-aware, human-perception-aligned metrics for few-step generators; standardized benchmarks for trajectory consistency.
- Tools/products/workflows: Metric suites for fidelity/diversity/utilitarian value; open leaderboards for flow-map training efficiency.
- Dependencies: Consensus on evaluation protocols; large-scale community datasets.
- Automated curriculum mid-training and solver adaptation (ML research)
- Use case: Adaptive selection of trajectory points, step sizes, and teachers during mid-training to optimize convergence.
- Tools/products/workflows: Meta-learning over solvers; online difficulty estimation; dynamic loss weighting.
- Dependencies: Reliable signals for curriculum design; stability across architectures; reproducibility.
- Extending mid-training beyond ODE-based generators (ML research)
- Use case: Apply the mid-training concept to normalizing flows, invertible networks, or reinforcement learning policy flows.
- Tools/products/workflows: Analogous trajectory targets and fixed regressors for non-ODE frameworks; hybrid training recipes.
- Dependencies: Theoretical mapping to non-ODE regimes; compatible teacher signals; empirical validation across tasks.
Cross-cutting assumptions and dependencies
- Teacher quality and licensing: CMT relies on fixed, explicit regression targets from a teacher sampler (e.g., EDM/EDM2 or small MF). Teacher biases and licenses affect feasibility and downstream model behavior.
- Scheduler alignment and solver choice: Deterministic PF-ODE trajectories (e.g., DPM-Solver++) and compatible schedules are required; misalignment reduces training stability.
- Modalities and metrics: The paper validates on vision using FID. Other domains need domain-appropriate losses and metrics (e.g., perceptual/audio/video measures, clinical utility in healthcare).
- Hardware and deployment constraints: Few-step inference reduces latency but model size, memory, and accelerators still matter for edge devices.
- Data governance and safety: Synthetic content must respect privacy, safety, and fairness; teacher-driven biases can propagate; policies and filters may be required.
Glossary
- Auto-Guidance: A guidance technique that augments diffusion sampling or distillation with automatically learned guidance signals to improve sample quality. "we follow AYF~\citep{sabour2025align} to distill a strong EDM2 with Auto-Guidance to surpass the vanilla flow map model"
- Average drift: In flow-map learning, the time-averaged velocity (drift) over an interval, used as a target in Mean Flow. "MF aims to learn the average drift, defined as"
- Conditional velocity: The expected instantaneous velocity of the perturbed state conditioned on the current noised sample and time. "fits a vector field to the conditional velocity of the perturbation:"
- Consistency Distillation (CD): A training approach where a student model learns consistency by matching one-step solver outputs from a pre-trained diffusion teacher. "Consistency Distillation (CD): a one-step solver with a pre-trained diffusion teacher"
- Consistency Mid-Training (CMT): A lightweight intermediate training stage that learns a trajectory-aligned proxy map between pre-training and post-training for flow maps. "Consistency Mid-Training (CMT) is a compact and principled stage"
- Consistency Models (CM): Flow-map models that learn few-step generation by enforcing cross-noise-level consistency to approximate the PF-ODE solution. "Flow map models such as Consistency Models (CM) and Mean Flow (MF) enable few-step generation"
- Consistency Trajectory Model (CTM): A model that learns the general flow map for arbitrary time pairs by enforcing consistency along PF-ODE trajectories. "Consistency Trajectory Model (CTM) \citep{kim2023ctm} was the first to learn the general flow map"
- Consistency Training (CT): A consistency objective that uses analytic noising/denoising relations to train without calling a teacher. "(ii) Consistency Training (CT): the analytic estimate"
- Cross-noise-level self-consistency: A constraint requiring predictions at different noise levels along the same trajectory to agree at the clean origin. "impose crossânoise-level self-consistency"
- Denoiser: A network that predicts the clean signal (or related quantity) from a noised input and timestep in diffusion training. "EDM~\citep{karras2022edm} trains a denoiser"
- Deterministic sampling: Sampling paths generated without stochasticity by integrating a deterministic ODE defined by the learned vector field. "since MF supports deterministic sampling"
- Diffusion models: Generative models that learn to reverse a progressive noising process, typically sampled by integrating a (probability flow) ODE or SDE. "Diffusion models~\citep{ho2020denoising,song2019generative} have become a cornerstone"
- DPM-Solver++: A high-order fast ODE solver tailored for diffusion model sampling. "DPM-Solver++ uses 16 solver steps"
- Drift: The deterministic component of the ODE that drives the state’s evolution over time in the PF-ODE. "Either or can be used to realize the drift."
- EDM: Elucidated Diffusion Models, a training and sampling framework for diffusion with specific parameterization/schedules. "EDM~\citep{karras2022edm} trains a denoiser"
- EDM2: A stronger, updated variant of EDM used as a teacher and baseline in experiments. "For ImageNet 6464 and 512512, we adopt EDM2~\citep{karras2024edm2}"
- ELatentLPIPS: A perceptual similarity metric computed in a latent space, used as a training loss for alignment. "and ELatentLPIPS~\citep{kang2024distilling} in latent space."
- EMA (Exponential Moving Average): A parameter-averaging technique for stabilizing training and evaluation of neural networks. "eliminating ad hoc tricks such as annealing, loss reweighting, custom time sampling, EMA variants, or nonlinear learning-rate schedules."
- Excess risk: The gap between the achieved objective value and its optimum under stochastic optimization analysis. "achieves the smallest excess risk and the lowest final error"
- FID: Fréchet Inception Distance, a standard metric comparing generated and real data distributions via feature statistics. "achieves state of the art two step FIDs: 1.97 on CIFAR-10"
- Finite differences: A numerical approximation of derivatives or averages using discrete differences between sampled states. "align with the finite differences between successive reference states"
- Flow map: The integral solution map of the PF-ODE that directly transports a state from time t to s in a single jump. "the flow map, "
- Flow Matching (FM): A training method that fits a vector field to the conditional velocity induced by a forward noising process. "Flow Matching~\citep{lipman2022flow}"
- Gradient bias: The discrepancy between gradients of the surrogate training objective and the oracle objective. "We define the gradient bias as"
- Jacobian-Vector Product (JVP): A computational primitive for efficiently multiplying a Jacobian by a vector, often used in implicit differentiation. "sCD requires costly JVP computations per iteration."
- Latent space: A compressed representation space (e.g., produced by an autoencoder) where models are trained or sampled. "in the latent space of Stable Diffusion (SD) autoencoders."
- LPIPS: Learned Perceptual Image Patch Similarity, a perceptual metric used to compare images and guide training. "specifically using LPIPS~\citep{zhang2018lpips} in pixel space"
- Marginals: The distributions p_t(x_t) at each time t induced by the forward noising process. "which induces marginals"
- Mean Flow (MF): A flow-map approach that models the average drift over an interval to learn few-step generation. "More recently, MF~\citep{geng2025mean} builds on the flow matching formulation by modeling the average drift"
- Mid-training: An intermediate training stage between pre-training and post-training that bridges the objectives and stabilizes learning. "We introduce mid-training, the first concept and practical method that inserts a lightweight intermediate stage"
- Mimgs (Millions of images): A data-budget measure counting the total number of training images processed. "measured under millions of training images (Mimgs)."
- NFE (Number of function evaluations): The count of model evaluations needed by a solver during sampling, reflecting inference cost. "with 63 NFEs."
- Numerical ODE solver: An algorithm that approximates ODE solutions via discrete steps (e.g., DPM-Solver++). "by running a numerical ODE solver with the pre-trained diffusion model"
- Oracle loss: The ideal objective using true (but inaccessible) flow-map targets, used for theoretical reference. "the oracle loss can be expressed as"
- PF-ODE (Probability Flow ODE): The deterministic ODE that tracks the evolution of the data distribution under score-based diffusion. "probability flow ordinary differential equation (PF-ODE)"
- Post-training: The final stage that trains the target few-step flow-map model using the initializer from pre-/mid-training. "the final flow map training (i.e., post-training)"
- Preconditioned parametrization: A reparameterization that scales network outputs/targets to improve conditioning and training stability. "with a preconditioned parametrization"
- Pre-training: The initial stage (e.g., diffusion training) that provides a teacher sampler or weights for subsequent stages. "between the (diffusion) pre-training and the final flow map training (i.e., post-training)"
- Prior (distribution): The initial distribution at maximal noise time from which sampling trajectories are started. "starting from $_T\sim p_{\mathrm{prior}$"
- Regression target: The explicit target used to supervise a model via a regression loss. "the regression target is applied with stop-gradient as"
- Reverse time generative perspective: Viewing training along backward trajectories from the prior to data, indexing states by their terminal noise-time. "from a reverse time generative perspective"
- Scheduler: The functions (e.g., α_t, σ_t) defining the forward noising schedule and relating different parameterizations. "given the scheduler."
- Solver trajectory: The discrete sequence of states produced by a chosen ODE solver along a teacher’s PF-ODE path. "map points along a solver trajectory from a pre-trained model"
- Stop-gradient: A training trick where target tensors are detached to prevent gradient flow, yielding fixed pseudo-targets. "supervise against stop-gradient, network-dependent pseudo-targets"
- Surrogate loss: A practical training objective that replaces inaccessible oracle targets with approximate, fixed targets. "the CM surrogate loss is"
- Teacher sampler: The reference sampler (e.g., a pre-trained diffusion or small flow-map model) that provides trajectories/targets for CMT. "We refer to these variants collectively as the teacher sampler."
- Time discretization: A partition of the continuous time interval into discrete steps used for solver trajectories and training. "We fix a decreasing time discretization"
- Trajectory-aligned initializer: An initialization whose outputs are consistent with the teacher’s ODE trajectories, stabilizing post-training. "This trajectory-aligned initializer provides a better starting point"
- Vector field: A function assigning velocities to states at each time, whose integration defines the PF-ODE dynamics. "fits a vector field"
Collections
Sign up for free to add this paper to one or more collections.