Papers
Topics
Authors
Recent
2000 character limit reached

Control-Rectify Flow Matching (CRFM)

Updated 25 December 2025
  • CRFM is a task-driven sampling strategy for diffusion models that injects semantic supervision during early data synthesis to mitigate instability and drift.
  • It integrates a rectification mechanism using segmentation loss feedback with a Multimodal Diffusion Transformer backbone to guide semantic alignment.
  • Empirical results demonstrate improved segmentation accuracy, with optimal correction in the initial diffusion steps producing measurable performance gains.

Control-Rectify Flow Matching (CRFM) is a task-driven sampling strategy for diffusion-based generative models, designed to inject direct semantic supervision into the early stages of synthetic data generation. It was introduced as a critical component for bridging the gap between synthetic and real data in remote sensing semantic segmentation, specifically within the Task-Oriented Data Synthesis (TODSynth) framework leveraging a Multimodal Diffusion Transformer (MM-DiT) backbone (Yang et al., 18 Dec 2025). The method addresses the instability and semantic drift common in mask-conditioned diffusion models, yielding improvements in both segmentation accuracy and alignment between generated samples and downstream tasks.

1. Conceptual Overview and Motivation

CRFM operates as an auxiliary correction mechanism during the diffusion sampling process. Standard flow-matching or rectified-flow generative paradigms transform noise vectors into plausible samples via ODE integration, leveraging model-predicted velocities. However, conventional sampling procedures lack closed-loop, task-driven correction for semantic targets, resulting in occasional misalignment—especially problematic for instance-level semantic mask generation, where pixel-level consistency is critical.

CRFM overcomes this by guiding the sampling trajectory toward semantically valid solutions, introducing feedback computed from a segmentation loss applied to partially generated outputs. The approach is motivated by the high plasticity observed at early diffusion steps—small modifications in latent space during these steps produce substantial changes in the synthesized output, making them ideal for effective rectification.

2. Mathematical Formulation of CRFM

CRFM is instantiated in diffusion models formalized via rectified flow. Let z1∼N(0,I)\mathbf{z}_1\sim \mathcal{N}(0, I) be the noise vector, z0\mathbf{z}_0 the target data latent, and vθ\mathbf{v}_\theta the predicted flow field. The forward ODE typically satisfies: dztdt=vθ(zt,t,Ct,Cm)\frac{d\mathbf{z}_t}{dt} = \mathbf{v}_\theta(\mathbf{z}_t, t, \mathbf{C}^t, \mathbf{C}^m) where Ct\mathbf{C}^t and Cm\mathbf{C}^m are conditioning tokens (text and mask, respectively).

At each time step tit_i, the predicted velocity viP\mathbf{v}^P_i is computed as: viP=vθ(zti,ti,Ct,Cm)\mathbf{v}^P_i = \mathbf{v}_\theta(\mathbf{z}_{t_i}, t_i, \mathbf{C}^t, \mathbf{C}^m)

CRFM augments this with a rectifying gradient vrec,i\mathbf{v}_{\mathrm{rec},i}, obtained by:

  1. Synthesizing an estimate of the final latent:

z0,it=zti−σti viP\mathbf{z}_{0,i}^t = \mathbf{z}_{t_i} - \sigma_{t_i}\,\mathbf{v}^P_i

  1. Decoding the latent to image space and evaluating a cross-entropy loss between the inferred segmentation S(D(z0,it))\mathcal{S}(\mathcal{D}(\mathbf{z}_{0,i}^t)) and the ground-truth mask Cm\mathbf{C}^m:

LCE,i=CE(S(D(z0,it)),Cm)\mathcal{L}_{\mathrm{CE},i} = \mathrm{CE}(\mathcal{S}(\mathcal{D}(\mathbf{z}_{0,i}^t)), \mathbf{C}^m)

  1. Computing the gradient with respect to the predicted velocity:

gi=∇viP LCE,i≈vrec,i\mathbf{g}_i = \nabla_{\mathbf{v}^P_i}\,\mathcal{L}_{\mathrm{CE},i} \approx \mathbf{v}_{\mathrm{rec},i}

  1. Updating the velocity for the Euler step:

vi′=viP+α vrec,i\mathbf{v}'_i = \mathbf{v}^P_i + \alpha\,\mathbf{v}_{\mathrm{rec},i}

with α\alpha a scalar hyperparameter.

Sampling then proceeds via: zti−1=zti+Δt vi′\mathbf{z}_{t_{i-1}} = \mathbf{z}_{t_i} + \Delta t\,\mathbf{v}'_i This correction is only applied in the initial NCRFMN_{\mathrm{CRFM}} steps (typically $4$ out of $23$), after which standard update rules resume.

3. Integration into TODSynth and MM-DiT

CRFM is fully integrated into the Task-Oriented Data Synthesis workflow for remote sensing (Yang et al., 18 Dec 2025):

  • Generator backbone: MM-DiT, a diffusion transformer with unified triple-attention (image, mask, and text tokens), is employed as the generative backbone.
  • Sampling strategy: During generation, the first NCRFMN_{\mathrm{CRFM}} steps of the sampling ODE are corrected using CRFM, with semantic loss computed via a pretrained segmentation network.
  • Fine-tuning: All cross-attention weights in the image and mask streams are fully fine-tuned for maximal pixel-wise and global semantic control.

Ablation studies establish that (1) triple-attention, (2) full fine-tuning, and (3) CRFM in the early sampling stages jointly deliver the highest segmentation fidelity and stability.

4. Empirical Results and Comparative Analysis

Extensive benchmarking on both few-shot and complex-scene semantic segmentation datasets—FUSU-4k and LoveDA-5k—demonstrates the impact of CRFM:

  • On FUSU-4k, the TODSynth system (with CRFM) achieves OA = 75.66%, mIoU = 49.41%, and mAcc = 63.27%, surpassing both SD v3.5+FM baseline and all ablated variants.
  • The inclusion of CRFM yields measurable gains over identical architectures using standard flow matching, adding +0.84 percentage points (pp) mIoU and +1.60 pp mAcc.
  • Ablation confirms that applying CRFM for the first four steps (of 23 total) is optimal; longer application causes mode collapse as indicated by failing FID scores.

These results verify that task-guided correction during early diffusion steps stabilizes sample quality, secures semantic alignment, and enhances downstream segmentation performance.

5. Algorithmic Summary and Pseudocode

The CRFM sampling procedure can be instantiated as follows (see (Yang et al., 18 Dec 2025)):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
z = z1  # Initialize with noise
for i in range(N, 0, -1):
    v_pred = model.predict_velocity(z, t[i], C_text, C_mask)
    if i > N - N_crfm:
        # CRFM stage
        z0_hat = z - sigma[t[i]] * v_pred
        x_hat = VAE.decode(z0_hat)
        loss_CE = CrossEntropy(seg_network(x_hat), C_mask)
        grad = backprop(loss_CE, v_pred)
        v = v_pred + alpha * grad
    else:
        v = v_pred
    z = z + delta_t * v
return VAE.decode(z)

6. Impact, Limitations, and Future Work

CRFM directly addresses the volatility and semantic drift of synthetic mask generation under diffusion models, imposing task-aligned correction at the sampling level without re-training the diffusion backbone. It enables cost-effective, automated generation of training data in settings with sparse or expensive annotations, with demonstrated gains in rare-class, high-complexity, and few-shot scenarios.

Nevertheless, the method as described is limited to injecting feedback via fixed segmentation teachers. Future work is suggested to:

  • Replace static teachers with self-supervised or foundation models in the feedback loop.
  • Extend CRFM to layout-to-image, object detection, counting, and anomaly identification tasks.
  • Incorporate additional modalities (e.g., SAR, multispectral) and more powerful, modality-specific feedback networks.

7. Summary Table: CRFM Parameters and Empirical Outcomes

Aspect CRFM Value/Setting Context
Correction Steps (NCRFMN_{\mathrm{CRFM}}) 4 (of 23 total) Applied during initial high-plasticity diffusion steps
Correction Weight (α\alpha) 0.25 Tuned on validation
Performance Gain (FUSU-4k) +0.84 pp mIoU, +1.60 pp mAcc Over SD v3.5+FM with identical arch.
Segmentation Feedback Cross-entropy loss on teacher segmentation Backpropagated wrt predicted velocity
Integration MM-DiT with triple attention, full fine-tuning Required for maximal effect

CRFM provides a conceptually and computationally tractable solution for integrating semantic supervision into the generative trajectory of diffusion models, resulting in more reliable, task-relevant synthetic data in semantic segmentation (Yang et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Control-Rectify Flow Matching (CRFM).