DiffusionNFT: Reinforcement Tuning for Diffusion Models

Updated 28 April 2026

DiffusionNFT is a reinforcement-driven fine-tuning method that re-parameterizes diffusion model training as a flow-matching problem to optimize both generation and editing tasks.
It integrates continuous, normalized scalar rewards into a contrastive supervised loss, enabling implicit policy improvement without reverse process likelihood estimation.
Empirical results show that DiffusionNFT achieves significant efficiency gains and improved performance across tasks by eliminating classifier-free guidance and supporting various solvers.

Diffusion Negative-aware FineTuning (DiffusionNFT) is a reinforcement-driven, flow-matching-based fine-tuning method for diffusion models which enables online optimization using scalar rewards, without probability estimation or reverse process likelihoods. The technique addresses challenges in applying reinforcement learning (RL) to diffusion models, notably for instruction-based image generation or editing, by operating entirely in the forward diffusion process and using a contrastive supervised loss to encode policy improvement. DiffusionNFT is solver-agnostic, inherently supports high-order samplers, does not require classifier-free guidance (CFG), and is empirically shown to achieve significant efficiency and quality gains across multiple generative and editing tasks (Zheng et al., 19 Sep 2025, Li et al., 19 Oct 2025, Team et al., 12 Feb 2026).

1. Mathematical Foundation and Algorithmic Structure

DiffusionNFT re-parameterizes the optimization of diffusion models as a flow-matching problem, targeting the velocity field $v_\theta(x_t, t)$ along the forward noising trajectory rather than attempting to optimize trajectory likelihoods:

Noising Process: For data sample $x_0 \sim \pi_0$ and $t \in [0,1]$ , define noised state $x_t = \alpha_t x_0 + \sigma_t \epsilon$ , $\epsilon \sim \mathcal{N}(0,I)$ .
Flow Matching Loss: Train $v_\theta(x_t, t)$ by minimizing:

$L_{FM}(\theta) = \mathbb{E}_{t, x_0, \epsilon} \left[ w(t) \|v_\theta(x_t, t) - v^*(x_t, t)\|_2^2 \right]$

with ground truth velocity $v^*(x_t, t)$ computed analytically given schedule $(\alpha_t, \sigma_t)$ .

Contrastive Policy Improvement: At each iteration, obtain $K$ generated samples $x_0 \sim \pi_0$ 0 from the previous policy $x_0 \sim \pi_0$ 1, compute normalized rewards $x_0 \sim \pi_0$ 2, and define two policy interpolants for each sample:

$x_0 \sim \pi_0$ 3

The overall training loss is:

$x_0 \sim \pi_0$ 4

This setup encourages policy improvement in the direction prescribed by reward feedback without explicit likelihoods or reverse-time rollouts (Zheng et al., 19 Sep 2025, Li et al., 19 Oct 2025).

2. Reward Signal Design and Integration

In DiffusionNFT, reward signals are integrated directly into the supervised flow matching objective as sample-wise weighting:

Reward Normalization: Raw scores $x_0 \sim \pi_0$ 5 are transformed into $x_0 \sim \pi_0$ 6 via centering and soft clipping:

$x_0 \sim \pi_0$ 7

with normalization $x_0 \sim \pi_0$ 8 (e.g. standard deviation per group) (Li et al., 19 Oct 2025, Zheng et al., 19 Sep 2025).

Reward Composition: In multi-reward setups, rewards may be aggregated across multiple automated evaluators (e.g., model-based VLMs, OCR, CLIP, human-like preference scores) via logit-weighted ensembles (Team et al., 12 Feb 2026).
Continuous Gradient Signal: Rewards are retained as continuous values to enable dense gradient propagation.

For specialized editing applications (notably text editing), reward design is further enhanced:

Layout-aware OCR: Composite rewards include character accuracy plus spatial penalties for character placement and over-scaling, masked by content correctness (Team et al., 12 Feb 2026).

3. Policy Improvement Dynamics and Optimization

DiffusionNFT encodes RL-style policy improvement by exploiting the geometry of the forward process:

Implicit Policy Update: The equilibrium predictor is guaranteed (by theorem) to take the form

$x_0 \sim \pi_0$ 9

where $t \in [0,1]$ 0 encodes the “improvement direction” revealed by the contrast between positive (high-reward) and negative (low-reward) trajectories (Zheng et al., 19 Sep 2025).

No Trajectory Storage or Likelihoods: Only clean ( $t \in [0,1]$ 1) samples and current velocity predictions are required for policy optimization. There is no need to reconstruct complete reverse diffusion paths or to estimate likelihoods.
Solver Invariance: As the method does not differentiate through sampling, any ODE/SDE solver (DDIM, DPM, higher-order methods) may be used for candidate generation (Li et al., 19 Oct 2025).

Optimization strategies such as semi-hard sample mining (selecting "on-the-margin" samples $t \in [0,1]$ 2) maximize the training signal and stabilize convergence (Team et al., 12 Feb 2026).

4. Empirical Results and Benchmarks

Empirical studies demonstrate that DiffusionNFT achieves marked improvements over prior RL and supervised-finetuning approaches, both in training efficiency and final task metrics:

Model/Setting	GenEval	OCR	PickScore	ClipScore	HPSv2.1	Speedup
SD3.5-M (init)	0.24	0.12	–	–	–	–
SD3.5-M + CFG	0.63	0.59	–	–	–	–
FlowGRPO + CFG, >5k iters	0.95	0.66	–	–	–	Reference
DiffusionNFT, 1k–1.7k iters	0.98	0.91	23.80	0.293	0.331	3–25× wall-clock
FireRed-Image-Edit (OCR)	–	0.983	–	–	–	–

Typical improvements include reaching GenEval 0.98 within 1k iterations (vs. 0.95 in >5k with FlowGRPO+CFG) and superior results on text-editing benchmarks (OCR = 0.983) (Zheng et al., 19 Sep 2025, Team et al., 12 Feb 2026).

5. Integration with Diffusion Architectures and Applications

DiffusionNFT is agnostic to the specific diffusion model architecture and applies broadly:

Generic Integration: The method operates at the velocity-prediction head of any flow-matching or transformer-based diffusion model (e.g., SD3.5, FireRed DiT, UniWorld-V2, Qwen-Image-Edit, FLUX-Kontext).
Online RL Stage: Typically introduced after pre-training, SFT, or DPO, as a final online RL fine-tuning phase (Team et al., 12 Feb 2026).
Critical for Text Editing: In instruction-based editing benchmarks, notably tasks involving fine-grained text manipulation, the method eliminates failure modes such as glyph collapse, over-scaling, and reward hacking (Team et al., 12 Feb 2026).
Reward Model Variations: Can directly utilize VLMs or multimodal LLMs as automated reward providers, with logit-based and group-filtered scoring to reduce noise and stabilize optimization (Li et al., 19 Oct 2025, Team et al., 12 Feb 2026).

6. Implementation Details and Hyperparameterization

Model training with DiffusionNFT observes the following practices:

Key Hyperparameters: $t \in [0,1]$ 3 (interpolation, typically 0.1–1), learning rates ( $t \in [0,1]$ 4– $t \in [0,1]$ 5), batch size (3–64), steps (500–1700), soft-update schedule for $t \in [0,1]$ 6, and EMA for inference stability (Zheng et al., 19 Sep 2025, Li et al., 19 Oct 2025, Team et al., 12 Feb 2026).
Reward Ensemble: Number of reward passes (e.g., $t \in [0,1]$ 7 for logit aggregation).
Semi-Hard Mining: Candidate selection by reward.
KL-Regularization: Optionally added between current and old policy to constrain optimization (Li et al., 19 Oct 2025).
Resource Utilization: Distributed strategies (e.g., FSDP, gradient checkpointing) for efficient training at large scale (Li et al., 19 Oct 2025).

7. Impact, Limitations, and Outlook

DiffusionNFT establishes a new likelihood-free paradigm for online RL in diffusion models:

Efficiency and Flexibility: Achieves up to $t \in [0,1]$ 8 speedup in wall-clock time, eliminates dependence on sampler type, and scales to multi-reward settings without architectural modification (Zheng et al., 19 Sep 2025, Li et al., 19 Oct 2025, Team et al., 12 Feb 2026).
Generalization: Demonstrates robust transfer across instruction-based editing tasks, and is model-agnostic (Li et al., 19 Oct 2025).
Reward Model Quality: Effectiveness is contingent on the expressivity and reliability of the external reward model (e.g., VLMs, OCR). Emerging challenges involve reward hacking and the calibration of ensemble or layout-aware scores.
Extensions: With increasing availability of pretrained MLLMs and higher-order black-box solvers, the methodology is expected to see continued adoption for both generation and editing applications.

A plausible implication is that the forward-process, negative-aware contrastive loss structure introduced by DiffusionNFT may serve as a foundation for RL-based tuning of other generative models where explicit likelihoods are intractable or sampling-based training dominates (Zheng et al., 19 Sep 2025, Li et al., 19 Oct 2025, Team et al., 12 Feb 2026).