Edit-R1: Instruction-Based Image Editing

Updated 4 July 2026

Edit-R1 is a post-training, instruction-based image editing framework that leverages diffusion models and policy optimization to overcome overfitting in supervised fine-tuning.
It utilizes DiffusionNFT for negative-aware velocity regression and contrastive reward weighting, ensuring model-agnostic optimization across diverse base models.
The framework integrates a training-free MLLM reward model with low-variance group filtering to stabilize optimization and improve the accuracy and diversity of edited images.

Edit-R1 is a post-training framework for instruction-based image editing based on policy optimization. It is introduced within UniWorld-V2 and combines Diffusion Negative-aware Finetuning (DiffusionNFT), a Multimodal LLM (MLLM) used as a unified, training-free reward model, and a low-variance group filtering mechanism for reducing MLLM scoring noise and stabilizing optimization. The framework is presented as model-agnostic, is applied to diverse base models including Qwen-Image-Edit and FLUX-Kontext, and attains state-of-the-art results on ImgEdit and GEdit-Bench, with UniWorld-V2 scoring 4.49 and 7.83, respectively (Li et al., 19 Oct 2025).

1. Motivation and problem setting

Edit-R1 is motivated by the claim that instruction-based image editing models trained purely with supervised fine-tuning tend to overfit to annotation patterns and shortcut learning, which hurts instruction-following, exploration, and generalization (Li et al., 19 Oct 2025). The paper states that supervised fine-tuning can cause models to ignore complex instructions and revert to merely reconstructing the input, and that reliance on large, but insufficiently diverse, datasets exacerbates overfitting. In this formulation, post-training alignment is not an auxiliary refinement but the central mechanism for moving beyond the limitations of supervised regressions.

The framework is positioned against several diffusion-model alignment paradigms. Conventional RLHF- and DPO-style methods for diffusion models rely on log-likelihoods or policy gradients and often lock training to specific first-order SDE samplers, introducing bias, limiting solver flexibility, and harming the quality/diversity trade-off. SDS variants are described as optimizing surrogate likelihood-related objectives that can be brittle and tightly coupled to specific noise parameterizations. Edit-R1 instead adopts DiffusionNFT as the policy optimization core, emphasizing a likelihood-free objective that directly optimizes the forward process of a flow-matching model with a contrastive, reward-weighted loss (Li et al., 19 Oct 2025).

A second problem addressed by Edit-R1 is reward specification. The paper identifies the absence of a universal reward model for editing, attributing this to the diverse nature of editing instructions and tasks. To bridge this gap, Edit-R1 uses an MLLM as a unified, training-free reward model and leverages its output logits to provide fine-grained feedback. This design choice makes reward modeling part of the alignment loop without requiring a separately trained editing-specific reward model (Li et al., 19 Oct 2025).

2. Flow-matching formulation and negative-aware policy optimization

Edit-R1 is built on a flow-matching, rectified-flow editor whose forward interpolation is defined as

$x_t = (1-t)x_0 + tx_1,$

with $t \in [0,1]$ , $x_0$ a data sample, and $x_1$ Gaussian noise. The associated flow-matching objective is

$\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$

where the target velocity is $v = x_1 - x_0$ and $c$ is the instruction/text embedding. Inference follows the ODE

$dx_t = v_{\theta}(x_t, t, c)dt.$

These equations define the editor’s policy as the velocity predictor $v_\theta(x_t,t,c)$ (Li et al., 19 Oct 2025).

Edit-R1 uses DiffusionNFT to optimize this velocity field with reward-weighted positive and negative implicit policies. The central objective is

$\mathcal{L}(\theta) =\mathbb{E}_{c,\pi^{\mathrm{old}}(x_0\mid c),t}\Big[ r\, \|v^+_\theta(x_t,c,t)-v\|_2^2 + (1-r)\, \|v^-_\theta(x_t,c,t)-v\|_2^2\Big],$

where

$t \in [0,1]$ 0

and

$t \in [0,1]$ 1

The paper characterizes this as a negative-aware design: the term weighted by $t \in [0,1]$ 2 pulls the model toward high-reward behavior, while the term weighted by $t \in [0,1]$ 3 pushes it away from low-reward behavior (Li et al., 19 Oct 2025).

Two properties are emphasized. First, the method is likelihood-free: the loss is a pathwise $t \in [0,1]$ 4 regression to the target velocity $t \in [0,1]$ 5 and does not require $t \in [0,1]$ 6 or policy gradients. Second, it is flow-consistent: the target in each term remains the flow velocity used by the underlying forward process. The paper presents these properties as the reason Edit-R1 can use higher-order black-box ODE solvers during online rollout collection while keeping training aligned with the same flow-matching semantics (Li et al., 19 Oct 2025). This matches the broader DiffusionNFT formulation introduced for diffusion post-training, where the method is described as optimizing diffusion models directly on the forward process via flow matching and as being solver-agnostic, likelihood-free, and off-policy by design (Zheng et al., 19 Sep 2025).

3. MLLM reward modeling and low-variance group filtering

Edit-R1 uses training-free MLLM scoring rather than chain-of-thought evaluation. The MLLM is queried on the sequence $t \in [0,1]$ 7 and responds token-by-token according to

$t \in [0,1]$ 8

The paper then defines logit-based scoring as

$t \in [0,1]$ 9

where $x_0$ 0 is the numeric value of token $x_0$ 1 and $x_0$ 2. The resulting score is normalized to $x_0$ 3 by

$x_0$ 4

These normalized scores become raw rewards for policy optimization (Li et al., 19 Oct 2025).

The raw MLLM scores are further transformed into the optimality probability used in the DiffusionNFT loss:

$x_0$ 5

Here $x_0$ 6 is a normalizing factor, described in the paper as, for example, the global standard deviation of rewards. This clipped affine transformation maps group-relative reward differences into a bounded scalar that determines the balance between positive and negative terms in the policy objective (Li et al., 19 Oct 2025).

A notable component of Edit-R1 is low-STD group filtering. The paper identifies a failure mode in which all candidates in a group receive very similar scores, so normalization amplifies tiny differences into noisy training signals. To mitigate this, Edit-R1 discards gradients from groups whose raw-reward mean exceeds $x_0$ 7 and whose variance falls below $x_0$ 8. The reported thresholds are $x_0$ 9 and $x_1$ 0 (Li et al., 19 Oct 2025). The stated purpose is variance reduction and stabilization of optimization, especially when the MLLM reward distribution within a candidate set has nearly collapsed.

4. Training pipeline and implementation profile

The Edit-R1 training loop is organized as an online cycle of sampling, MLLM scoring, filtering, and DiffusionNFT update. For each input pair consisting of a source image and an instruction, the current policy $x_1$ 1 generates a group of candidate edited images using a black-box higher-order sampler. The paper uses DPM-Solver with 6 steps for this stage. Each candidate is scored by the MLLM on the tuple of original image, edited image, and instruction; the resulting scores are normalized into optimality probabilities; low-variance groups are filtered; and the remaining samples are used to optimize the negative-aware velocity-regression loss (Li et al., 19 Oct 2025).

Several implementation settings are explicitly reported. The training configuration uses Learning Rate $x_1$ 2, $x_1$ 3, $x_1$ 4, Batch Size $x_1$ 5, EMA Decay $x_1$ 6, Sampling Inference Steps $x_1$ 7, Resolution $x_1$ 8, Images Per Prompt $x_1$ 9, Number of Groups $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$ 0, KL Loss Weight $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$ 1, and Guidance Strength $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$ 2 (Li et al., 19 Oct 2025). The paper also notes that a small auxiliary KL regularization is used in practice but does not provide an explicit KL formula.

The framework is described as model-agnostic. It is applied to FLUX.1-Kontext [Dev], Qwen-Image-Edit [2509], and UniWorld-V2, with consistent improvements over the corresponding bases (Li et al., 19 Oct 2025). On the systems side, the reported fine-tuning setup uses 3 nodes for FLUX.1-Kontext [Dev], 6 nodes for Qwen-Image-Edit [2509], and 9 nodes for UniWorld-V2, each node equipped with $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$ 3A100. MLLM scoring is served via vLLM on a single node, and FSDP together with gradient checkpointing is applied for large models (Li et al., 19 Oct 2025).

The choice of DPM-Solver is not incidental. Because the trained object is the ODE velocity field itself, the paper argues that sampling is numerically decoupled from the optimization objective. This is presented as a practical advantage over likelihood-based or reverse-process methods that are tied to specific samplers. In the broader diffusion-alignment literature, DiffusionNFT is likewise formulated so that data collection can use arbitrary black-box samplers and only clean samples with rewards need to be stored (Zheng et al., 19 Sep 2025).

5. Benchmark performance and ablation evidence

Edit-R1 is evaluated primarily on ImgEdit and GEdit-Bench. The reported quantitative results show improvements for both FLUX.1-Kontext [Dev] and Qwen-Image-Edit [2509], and UniWorld-V2 reaches the best overall scores on both benchmarks (Li et al., 19 Oct 2025).

Benchmark	Model / variant	Score
ImgEdit	FLUX.1 Kontext [Dev]	3.71
ImgEdit	UniWorld-FLUX.1-Kontext	4.02
ImgEdit	Qwen-Image-Edit [2509]	4.35
ImgEdit	UniWorld-Qwen-Image-Edit	4.48
ImgEdit	UniWorld-V2	4.49
GEdit-Bench	FLUX.1-Kontext [Dev]	6.00
GEdit-Bench	UniWorld-FLUX.1-Kontext	6.74
GEdit-Bench	Qwen-Image-Edit [2509]	7.54
GEdit-Bench	UniWorld-Qwen-Image-Edit	7.76
GEdit-Bench	UniWorld-V2	7.83

The ImgEdit results include two comparative claims emphasized in the paper: UniWorld-FLUX.1-Kontext surpasses FLUX.1 Kontext [Pro] at 4.00, and UniWorld-Qwen-Image-Edit surpasses GPT-Image-1 [High] at 4.20 (Li et al., 19 Oct 2025). On GEdit-Bench, UniWorld-FLUX.1-Kontext also surpasses the Pro version, whose reported score is 6.56 (Li et al., 19 Oct 2025).

Ablation results are used to isolate the contribution of DiffusionNFT and the reward pipeline. On Qwen-Image-Edit [2509] evaluated on GEdit-Bench, the paper reports a progression from the baseline score of 7.54 to 7.66 with “+ NFT (7B),” to 7.74 with “+ 32B,” and to 7.76 with “+ Group Filtering” (Li et al., 19 Oct 2025). This sequence supports three specific claims made by the authors: DiffusionNFT alone improves the base model, larger reward models provide further gains, and low-variance group filtering adds an additional improvement.

Setting on Qwen-Image-Edit [2509]	GEdit-Bench score
Baseline	7.54
+ NFT (7B)	7.66
+ 32B	7.74
+ Group Filtering	7.76

The paper also compares policy optimization methods on FLUX.1-Kontext [Dev] and states that DiffusionNFT outperforms Flow-GRPO and its local-std variant on ImgEdit (Li et al., 19 Oct 2025). Qualitatively, the aligned models are described as executing more precise edits, preserving content better, and generalizing to diverse instructions such as Adjust, Extract, Remove, and Hybrid. Human preference studies are reported to prefer UniWorld-FLUX.1-Kontext to its base model and to the stronger official Pro variant for instruction alignment (Li et al., 19 Oct 2025).

6. Relation to broader diffusion alignment research

Edit-R1 inherits its optimization core from DiffusionNFT, originally formulated as online reinforcement learning for diffusion models on the forward process via flow matching (Zheng et al., 19 Sep 2025). In that broader formulation, DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, operates without reverse-process likelihoods, and is compatible with arbitrary black-box solvers. Edit-R1 imports that machinery into instruction-based image editing, replacing generic reward models with an MLLM-based reward loop and adapting the online sampling process to edited-image candidate groups (Li et al., 19 Oct 2025, Zheng et al., 19 Sep 2025).

This places Edit-R1 in contrast with FlowGRPO-style methods that couple training to reverse-process likelihood estimation and particular samplers. The Edit-R1 paper explicitly presents likelihood-free optimization and higher-order solver compatibility as key reasons why DiffusionNFT is better suited to instruction-based editing, where online exploration and sampler flexibility are central (Li et al., 19 Oct 2025). That framing is consistent with the original DiffusionNFT paper, which reports that the method is up to $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$ 4 more efficient than FlowGRPO in head-to-head comparisons and improves the GenEval score from 0.24 to 0.98 within 1k steps while remaining CFG-free (Zheng et al., 19 Sep 2025).

A later line of work extends DiffusionNFT through learned value estimation on noisy latents. “Stitched Value Model for Diffusion Alignment” reports that StitchVM lifts pretrained pixel-space reward models to latent space and that, in this setting, DiffusionNFT becomes $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$ 5 faster while preserving quality metrics (Go et al., 19 May 2026). This suggests a possible acceleration path for Edit-R1-like training regimes that repeatedly evaluate reward proxies on intermediate states, although that integration is not reported in the Edit-R1 paper itself.

7. Limitations, failure modes, and open directions

The central limitation acknowledged by Edit-R1 is the absence of a universal reward model for editing. The MLLM is introduced precisely because editing instructions and tasks are diverse, but the paper also notes that reward hacking risk is higher with small MLLMs such as 3B. Scaling the MLLM from 7B to 32B is reported to maintain higher reward variance and sustained exploration, reducing hacking (Li et al., 19 Oct 2025). A plausible implication is that reward-model scale is not merely a quality knob but part of the optimization stability mechanism.

Another limitation concerns reward variance collapse within groups. The low-STD filter is motivated by the observation that when candidates receive nearly identical scores, normalization can magnify noise rather than signal. Edit-R1 addresses this with thresholds $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$ 6 and $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$ 7, but this is a procedural mitigation rather than a full solution to reward uncertainty (Li et al., 19 Oct 2025). The paper also advises keeping rollout steps low for efficiency while verifying that candidate diversity remains adequate for learning.

A further technical caveat is that the practical training recipe includes a small KL term with weight $\mathcal{L}_{FM}(\theta) = \mathbb{E}_{t,x_0 \sim X_0,x_1 \sim X_1}\left[\|v - v_{\theta}(x_t, t, c)\|_2^2\right],$ 8, yet the exact KL form is not detailed in the paper (Li et al., 19 Oct 2025). This leaves a gap between the clean theoretical presentation of the likelihood-free objective and the full optimization stack used in the reported experiments. The same section recommends monitoring stability and gradually annealing if needed, which suggests that regularization details remain operationally important even though the main contribution is framed through the negative-aware velocity-regression loss.

Finally, Edit-R1’s strengths should not be conflated with a claim that any negative-aware diffusion finetuning automatically transfers across domains. In diffusion alignment more broadly, later work shows that value-estimation components can substantially reduce DiffusionNFT compute (Go et al., 19 May 2026), while other work on diffusion unlearning argues that purely negative-aware or repulsive finetuning objectives can be vulnerable to relearning attacks if they do not converge to a stable new optimum (Yuan et al., 3 Dec 2025). This suggests that Edit-R1’s empirical success depends on the specific combination of flow-consistent optimization, training-free MLLM reward modeling, and low-variance filtering rather than on negative-aware weighting alone.