DiffusionNFT: Negative-Aware Finetuning

Updated 4 July 2026

DiffusionNFT is a paradigm that integrates negative signals such as forbidden concepts and low-reward samples into the finetuning process of diffusion models.
It employs diverse formulations including bilevel concept suppression and online reinforcement learning to enhance model reliability and efficiency.
Empirical studies indicate that DiffusionNFT improves generation quality, reduces negative behaviors, and achieves efficient adaptation across various diffusion tasks.

Diffusion Negative-aware Finetuning (DiffusionNFT) denotes a class of diffusion-model adaptation procedures in which training is explicitly informed by negative evidence—such as forbidden concepts, low-reward generations, or human-dispreferred outputs—rather than relying only on positive reconstruction or preference signals. In the literature covered here, the term spans multiple closely related but technically distinct formulations: a supervised bilevel scheme that unifies fine-tuning and concept suppression for pruned diffusion models, an online reinforcement-learning formulation that optimizes flow-matching models directly on the forward process, and later extensions for value estimation, preference alignment, and instruction-based image editing (Shirkavand et al., 2024, Zheng et al., 19 Sep 2025, Go et al., 19 May 2026, Wang et al., 16 May 2025, Li et al., 19 Oct 2025). This suggests an umbrella usage centered on negative-aware post-training rather than a single canonical loss.

1. Terminological scope and conceptual background

Across these works, “negative-aware” consistently refers to training rules that do not treat all generated or supervised targets symmetrically. Instead, they incorporate explicit avoidance signals: a concept to suppress, a low-reward sample to push away from, or a negative-preference branch used to sharpen classifier-free guidance. What changes across papers is the object being optimized: a noise predictor in conditional diffusion, a velocity field in flow matching, or a separate negative branch used during guidance (Shirkavand et al., 2024, Zheng et al., 19 Sep 2025, Wang et al., 16 May 2025).

A broader conceptual backdrop is the multi-task view of diffusion training. “Addressing Negative Transfer in Diffusion Models” treats denoising across timesteps as multi-task learning, reports that task affinity decreases as the timestep or signal-to-noise-ratio gap widens, and shows that negative transfer can occur even in standard diffusion training. Its interval-clustering and cluster-level MTL machinery were presented as a finetuning-friendly way to protect vulnerable timesteps during adaptation. This is not the same algorithmic family as later DiffusionNFT papers, but it provides an important interpretation of what “negative-aware” can mean in diffusion optimization: not only suppressing unwanted content, but also preventing harmful interference across denoising tasks (Go et al., 2023).

Use in the literature	Object updated	Negative signal
Controlled fine-tuning for pruned diffusion models	Pruned student noise predictor	Forbidden concept and anchor concept
Online diffusion RL on the forward process	Velocity field / denoiser	Low-reward generations via optimality probability
Negative-preference optimization for CFG	Negative-conditional branch	Reversed preference pairs or $1-R$ reward
Instruction-based editing post-training	Flow-matching image editor	MLLM-derived reward with group filtering

2. Bilevel controlled fine-tuning and concept suppression

A prominent early instantiation appears in “Efficient Fine-Tuning and Concept Suppression for Pruned Diffusion Models,” which maps directly onto DiffusionNFT even though the paper itself refers to “our bilevel method” and “controlled fine-tuning.” The motivation is specific: pruning reduces compute and improves deployability, but the subsequent distillation step can reintroduce undesirable behaviors from the teacher, including copyrighted styles or NSFW concepts, even when such instances are absent from the fine-tuning dataset. The paper argues that a naïve two-stage pipeline—first restoring quality, then running concept unlearning—is both inefficient and suboptimal because parameters good for fine-tuning need not be good for unlearning, and unlearning can damage generation quality (Shirkavand et al., 2024).

The proposed solution is a bilevel optimization in which the lower level restores the pruned model and the upper level suppresses a specified concept. With teacher noise predictor $\epsilon_T$ , pruned student $\epsilon_S$ , positive data $D_f$ , forbidden concept $c$ , and anchor concept $c'$ , the lower-level objective combines diffusion regression with output-level and feature-level distillation:

$L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$

The upper-level suppression term uses anchor distillation:

$L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$

The constrained problem is then relaxed into a penalty-based minimax objective and solved with a double-loop first-order algorithm rather than implicit differentiation or truncated backpropagation (Shirkavand et al., 2024).

This formulation is negative-aware in a precise sense. The student is not merely asked to fit $D_f$ well; it is also trained so that conditioning on the forbidden concept $c$ moves its prediction toward the teacher’s anchor concept $\epsilon_T$ 0. The paper also evaluates an ESD-style upper-level loss in which the target is derived from classifier-free guidance structure:

$\epsilon_T$ 1

This replaces simple omission of unwanted data with active redirection away from the target concept (Shirkavand et al., 2024).

The implementation details reinforce the intended deployment setting. The teacher is Stable Diffusion 2.1; the student is a pruned SD2.1 U-Net expert obtained with APTP; the text encoder remains frozen; the fine-tuning set is MS-COCO-2017; and the main experiments use an expert at 80% MAC budget. The reported weights are diffusion loss $\epsilon_T$ 2, output KD $\epsilon_T$ 3, and feature KD $\epsilon_T$ 4, with $\epsilon_T$ 5, $\epsilon_T$ 6 lower steps per upper step, AdamW, lower learning rate $\epsilon_T$ 7, upper learning rate $\epsilon_T$ 8, and about $\epsilon_T$ 9 total iterations (Shirkavand et al., 2024).

3. Forward-process online RL formulation

A later and more explicit use of the name appears in “DiffusionNFT: Online Diffusion Reinforcement with Forward Process,” which defines Diffusion Negative-aware FineTuning as an online RL paradigm for diffusion and flow-matching models. The core departure from GRPO-style reverse-process training is that optimization is performed directly on the forward process, without stepwise reverse likelihoods, and with arbitrary black-box solvers for sampling. The method is framed in velocity space under flow matching, with forward path

$\epsilon_S$ 0

and target vector field

$\epsilon_S$ 1

Under rectified flow, this reduces to the familiar linear interpolation regime used by later editing work as well (Zheng et al., 19 Sep 2025, Li et al., 19 Oct 2025).

The negative-aware mechanism is implemented by splitting samples into implicit positive and negative branches using an optimality probability $\epsilon_S$ 2 derived from reward-normalized generations. Let $\epsilon_S$ 3 be the frozen data-collection policy and $\epsilon_S$ 4 the current model. The method defines

$\epsilon_S$ 5

and optimizes

$\epsilon_S$ 6

The paper’s theory interprets this as an implicit policy-improvement operator in forward vector-field space: high-reward samples pull the model toward the positive branch, while low-reward samples create an explicit repulsive term through the negative branch. No reverse-process likelihood estimation is required, and training needs only final clean images rather than stored sampling trajectories (Zheng et al., 19 Sep 2025).

This formulation is technically distinct from the bilevel concept-suppression variant. It is CFG-free, solver-agnostic, and naturally off-policy through an EMA-style update of the data-collection policy. It also shifts the meaning of “negative-aware” from targeted erasure to reward-weighted avoidance of bad generations. In head-to-head comparisons, the paper reports that DiffusionNFT is up to $\epsilon_S$ 7 more efficient than FlowGRPO, reaches GenEval $\epsilon_S$ 8 within $\epsilon_S$ 9 steps from a CFG-free SD3.5-Medium baseline of $D_f$ 0, and improves OCR from $D_f$ 1 to $D_f$ 2, PickScore from $D_f$ 3 to $D_f$ 4, CLIPScore from $D_f$ 5 to $D_f$ 6, HPSv2.1 from $D_f$ 7 to $D_f$ 8, Aesthetics from $D_f$ 9 to $c$ 0, ImageReward from $c$ 1 to $c$ 2, and UnifiedReward from $c$ 3 to $c$ 4 (Zheng et al., 19 Sep 2025).

4. Negative signals, reward estimation, and guidance design

One major axis of variation in DiffusionNFT is the source of the negative signal. In the bilevel concept-suppression formulation, negativity is specified symbolically through forbidden prompts, artist names, NSFW phrases, and anchor concepts such as “art,” “person,” or the unconditional branch. The upper-level loss is therefore concept-directed and local to a specified unlearning target (Shirkavand et al., 2024).

In forward-process RL, negativity is implicit in reward. The model samples images, receives scalar evaluations, and converts them into an optimality probability by per-prompt centering, scaling, clipping, and affine remapping to $c$ 5. The negative branch is then weighted by $c$ 6 rather than by a manually curated prompt list. “Stitched Value Model for Diffusion Alignment” extends this setup by addressing the central difficulty that rewards are naturally defined on clean images, while DiffusionNFT training and guidance require value estimates at noisy latents. That paper contrasts two existing approximations—Tweedie-style posterior-mean evaluation and Monte Carlo rollouts—and proposes StitchVM, which stitches a frozen diffusion backbone head to a truncated pixel-space reward model tail so that a value function $c$ 7 can be amortized directly on noisy latents. In the reported SD3.5-Medium setup, DiffusionNFT requires $c$ 8 GPU-hours, whereas DiffusionNFT plus StitchVM requires $c$ 9 GPU-hours, with nearly identical DrawBench metrics such as HPSv2 $c'$ 0 versus $c'$ 1, DFN $c'$ 2 versus $c'$ 3, ImageReward $c'$ 4 versus $c'$ 5, PickScore $c'$ 6 versus $c'$ 7, and GenEval $c'$ 8 versus $c'$ 9 (Go et al., 19 May 2026).

A third variant appears in “Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models.” Here the negative signal is not a forbidden concept or a scalar penalty on generated samples, but a dedicated negative-preference model used as the unconditional or negative-conditional branch in classifier-free guidance. The method keeps the base preference-optimization objective unchanged and trains a separate offset $L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 0 by reversing preference pairs for DPO/SPO-style methods or by replacing the reward with $L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 1 for RL/DR methods. Inference uses

$L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 2

with $L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 3. This makes the negative branch itself preference-aware, rather than merely unconditional (Wang et al., 16 May 2025).

Instruction-based editing introduces yet another reward source. In “Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback,” DiffusionNFT is combined with a training-free MLLM reward model. Candidate edits are scored through constrained score-token logits over $L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 4, normalized to $L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 5, and converted into the optimality probabilities used by the flow-matching NFT loss. Because near-saturated groups can create small standard deviations and unstable normalization, the paper discards groups satisfying $L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 6 and $L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 7. This low-STD group filtering is presented as a variance-reduction mechanism rather than a new reward function (Li et al., 19 Oct 2025).

5. Empirical behavior across tasks

For controlled fine-tuning and concept suppression, the strongest quantitative evidence comes from artist-style erasure and NSFW removal on pruned SD2.1. On artist removal, the bilevel method reports CLIP $L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 8, CP $L_{\mathrm{ft}}(\theta)=L_{\mathrm{Diff}}(\theta)+\lambda_{\mathrm{OutKD}}L_{\mathrm{OutKD}}(\theta)+\lambda_{\mathrm{FeatKD}}L_{\mathrm{FeatKD}}(\theta).$ 9, CSD $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 0, FID $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 1, and COCO CLIP $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 2, compared with Distilled + ConceptPrune at CLIP $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 3, CP $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 4, CSD $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 5, FID $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 6, and COCO CLIP $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 7. On NSFW removal, the paper states that nudity reduction is comparable to baselines while retaining quality, and reports adversarial robustness scores of $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 8 on MMA and $L_{\mathrm{unlearn}}(\theta)=\mathbb{E}\big[\|\epsilon_T(x_t,t,c')-\epsilon_S(x_t,t,c)\|^2\big].$ 9 on Ring-A-Bell (Shirkavand et al., 2024).

For online diffusion RL, the central empirical claim is efficiency under solver-agnostic, CFG-free training. The reported comparison with FlowGRPO emphasizes both speed and sample efficiency: DiffusionNFT reaches GenEval $D_f$ 0 within $D_f$ 1 steps, whereas FlowGRPO reaches $D_f$ 2 only after more than $D_f$ 3 steps and with additional CFG employment. The method is also reported to outperform or match larger CFG-based models while using a single conditional model at inference (Zheng et al., 19 Sep 2025).

For preference alignment, Diffusion-NPO presents DiffusionNFT as negative-preference finetuning in weight space. On SD1.5, Diff.-SPO improves over the base model, but adding NPO further improves several metrics; for example, Diff.-SPO + NPO reports PickScore $D_f$ 4, HPSv2 $D_f$ 5, ImageReward $D_f$ 6, and LAION-Aesthetic $D_f$ 7. On DreamShaper, NPO improves PickScore from $D_f$ 8 to $D_f$ 9 and HPSv2 from $c$ 0 to $c$ 1 at $c$ 2, and increases LAION-Aesthetic to $c$ 3 at $c$ 4. The same paper also reports gains on SDXL and VideoCrafter2/VADER, including out-of-domain human videos (Wang et al., 16 May 2025).

For image editing, Edit-R1 applies flow-matching DiffusionNFT with MLLM reward to FLUX.1-Kontext, Qwen-Image-Edit, and UniWorld-V2. The reported final scores are $c$ 5 on ImgEdit and $c$ 6 on GEdit-Bench for UniWorld-V2. The same framework improves FLUX.1-Kontext [Dev] from $c$ 7 to $c$ 8 on ImgEdit and from $c$ 9 to $\epsilon_T$ 00 on GEdit-Bench, and improves Qwen-Image-Edit from $\epsilon_T$ 01 to $\epsilon_T$ 02 on ImgEdit and from $\epsilon_T$ 03 to $\epsilon_T$ 04 on GEdit-Bench (Li et al., 19 Oct 2025).

A plausible implication is that DiffusionNFT is especially useful when the adaptation problem is under-specified by positive supervision alone. In pruned-model deployment, this appears as selective unlearning without a second stage; in online RL, as avoidance of low-reward modes without reverse likelihoods; in CFG alignment, as an explicitly trained negative branch; and in editing, as reward-driven correction beyond supervised instruction pairs.

6. Limitations, failure modes, and adjacent debates

The literature also makes clear that negative-aware finetuning is not automatically stable or irreversible. In the bilevel concept-suppression setting, reported risks include over-suppression, style bleeding, reduced diversity, catastrophic forgetting, prompt sensitivity, and residual leakage under adversarial prompts. The recommended mitigations are balanced inner/outer cadence, conservative guidance strength, careful choice of anchor concepts, prompt augmentation, and runtime safety filters (Shirkavand et al., 2024).

The online RL lineage introduces a different set of sensitivities. DiffusionNFT depends on reward calibration, off-policy update scheduling, and the choice of $\epsilon_T$ 05. The paper reports that smaller $\epsilon_T$ 06 can accelerate reward improvement but may destabilize training, while overly on-policy updates can be fast early and unstable later. Reward hacking remains a general issue when reward models are imperfect (Zheng et al., 19 Sep 2025). The value-modeling literature adds that Tweedie-style estimators are biased at high noise, Monte Carlo rollouts are expensive and high-variance, and even stitched value models can weaken at very high noise levels, motivating intermediate stopping windows such as $\epsilon_T$ 07 on a $\epsilon_T$ 08-step schedule (Go et al., 19 May 2026).

Instruction-based editing makes the reward problem especially visible. Edit-R1 reports that small MLLM evaluators can collapse reward variance and encourage reward hacking, whereas larger evaluators such as $\epsilon_T$ 09 models maintain higher variance and more stable optimization. Low-STD group filtering improves stability, but it also discards training data and therefore trades sample efficiency for variance control (Li et al., 19 Oct 2025).

A separate critique concerns robustness of finetuning-based unlearning itself. “Towards Irreversible Machine Unlearning for Diffusion Models” proposes DiMRA, an auxiliary-data relearning attack that can reverse finetuning-based unlearning methods such as ESD, CA, Salun, SHS, and Ediff, and argues that many such methods remain near the original optimum because of non-convergent or mismatched unlearning terms. Its defense, DiMUM, replaces forgetting-style objectives with convergent memorization of alternative content under the unlearned condition. This does not refute DiffusionNFT as a training principle, but it does qualify a common assumption: negative-aware fine-tuning is not necessarily irreversible unless the objective settles into a genuinely new stable optimum (Yuan et al., 3 Dec 2025).

Finally, the term’s breadth can itself be a source of confusion. Some papers use DiffusionNFT for concept suppression in conditional diffusion, others for online RL with flow matching, and others for negative-preference guidance or editing post-training. The common thread is explicit negative supervision inside the finetuning objective. Beyond that shared principle, the concrete optimization geometry, target parameterization, and evaluation protocol differ substantially across subfields (Shirkavand et al., 2024, Zheng et al., 19 Sep 2025, Wang et al., 16 May 2025).