MotionNFT: Fine Tuning for Motion Editing

Updated 17 December 2025

MotionNFT is a post-training framework that improves motion-centric image editing by incorporating explicit optical flow-based motion alignment rewards.
It employs negative-aware policy updates to fine tune pre-trained diffusion and flow-matching models without modifying the original generator architecture.
Experimental results reveal enhanced fidelity, coherence, and motion accuracy, demonstrating significant improvements over traditional editing methods.

MotionNFT (Motion-guided Negative-aware Fine Tuning) is a post-training framework designed to improve motion-centric image editing by incorporating explicit motion alignment rewards within existing diffusion or flow-matching editors. The methodology specifically targets the challenge of generating pixel-accurate motion transformations—modifying actions, postures, and interactions—while maintaining strict preservation of identity, structural integrity, and background consistency. Unlike prior techniques that emphasize semantic or style compliance, MotionNFT introduces optical-flow-based supervision and negative-aware policy-gradient updates to ensure edits yield physically plausible and well-aligned motion relative to ground truth (Wan et al., 11 Dec 2025).

1. Motivation and Problem Definition

Motion-centric image editing requires inference and application of coherent pixel displacement fields, representing the movement direction and amplitude of objects or subjects, in response to editing instructions. Existing image editing paradigms, mainly diffusion-based or instruction-tuned editors, are proficient at style or semantic attribute changes ("what" to change) but lack mechanisms for orchestrating the spatial "how"—that is, the explicit movement of pixels to match real physical motion. Furthermore, constructing datasets that capture such realistic motion transitions is nontrivial because accurate motion ground truth must be mined and validated from high-fidelity video sequences, rather than being synthesized or hand-drawn. Training or joint-training baseline models from scratch remains computationally prohibitive for these specialized datasets.

MotionNFT addresses these bottlenecks by introducing a lightweight, post-training fine-tuning regime. Initiating from a pre-trained diffusion or flow-matching base (e.g., FLUX.1 Kontext or Qwen-Image-Edit), the framework augments the training objective with motion-aware rewards grounded in pixel-flow consistency, using only minimal update steps and without requiring modification of the generator architecture (Wan et al., 11 Dec 2025).

2. Motion Alignment Reward Formulation

MotionNFT operationalizes a motion alignment reward ( $r_{motion}$ ) based on optical-flow discrepancies between (1) the input and the model-generated edited output $(I_{orig}, I_{edited})$ and (2) the input and the target ground-truth motion edit $(I_{orig}, I_{gt})$ . A pre-trained optical-flow estimator $\mathcal{K}$ provides flow fields $V_{pred}$ and $V_{gt}$ across image coordinates, normalized by the image diagonal $\Delta = \sqrt{H^2 + W^2}$ . The reward constructs three distinct measures:

Magnitude Discrepancy $(D_{mag})$ : Mean $L_1$ discrepancy (possibly with exponent $q \in (0,1)$ ) between predicted and target pixel-wise motion vectors.
Directional Discrepancy $(D_{dir})$ : Weighted average cosine-based mismatch of the direction between predicted and target motion vectors, focusing on pixels with sufficient motion intensity.
Movement Regularization $(M_{move})$ : Penalty term for underestimating average movement magnitude relative to ground truth.

The composite distance is $D_{comb} = \alpha D_{mag} + \beta D_{dir} + \lambda_{move} M_{move}$ , normalized to $[0,1]$ , inverted, and quantized to six levels to form $r_{motion} \in \{0.0, 0.2, \ldots, 1.0\}$ . This reward provides a continuous supervisory signal measuring how closely the model's edits replicate real motion dynamics (Wan et al., 11 Dec 2025).

3. Negative-Aware Fine-Tuning and Training Objective

The framework employs negative-aware policy updates embedded within the DiffusionNFT regime. For each batch and diffusion timestep $t$ , a velocity field $v_\theta(z_t, c, t)$ is predicted. Two implicit policy decodings are defined:

$v_\theta^+ = (1-\beta) v_{old} + \beta v_\theta$
$v_\theta^- = (1+\beta) v_{old} - \beta v_\theta$

The NFT loss combines these through the reward:

$L_{NFT}(\theta) = \mathbb{E}_{t, x_0, \epsilon} [r \|v_\theta^+ - v\|_2^2 + (1 - r)\|v_\theta^- - v\|_2^2]$

The overall objective preserves the original editing loss $L_{edit}$ and introduces a motion-guided term $L_{motion} = -R_{motion}$ : $L_{total} = L_{edit} + \lambda L_{motion}$ . This construction ensures motion alignment is enforced without compromising base editing performance (Wan et al., 11 Dec 2025).

4. Implementation Protocol and Hyperparameters

MotionNFT requires no architectural modification to the generator. During each training iteration, the framework:

Samples a batch with editing prompts and inputs.
Generates multiple candidate edits per prompt using the current model.
Computes $r_{motion}$ for each candidate using the optical-flow estimator.
Optionally combines $r_{motion}$ with a semantic reward from a 32B multimodal LLM (MLLM), finding empirically that a 50% mixture of MLLM and motion reward yields optimal performance.
Backpropagates using the DiffusionNFT objective.

Key hyperparameters include a learning rate of $3 \times 10^{-4}$ , batch size $2 \times 8$ , NFT sampling groups of $24 \times 8$ , reward weights $\alpha = 0.7$ , $\beta = 0.2$ , $\lambda_{move} = 0.1$ , NFT guidance strength $1.0$, and KL weight $1e{-4}$ . No additional trainable parameters are introduced (Wan et al., 11 Dec 2025).

5. Dataset and Benchmarking Protocols

The MotionEdit dataset underpins the framework's empirical paper, comprising 10,157 $(I_{orig}, instruction, I_{gt})$ triplets. These are obtained from high-quality text-to-video outputs and span categories such as pose/posture changes, locomotion and spatial shifts, object state deformation, orientation/viewpoint adjustment, subject-object interactions, and inter-subject interactions. All samples are $512 \times 512$ and split into 9,142 training and 1,015 evaluation entries (90/10).

For evaluation, the MotionEdit-Bench protocol incorporates:

Generative Metrics: MLLM-based scores for fidelity, preservation, and coherence, each on a $0–5$ scale, with their mean ("Overall") as the final score.
Discriminative Metric: Motion Alignment Score (MAS), $0–100$, reflecting the geometric correspondence between ground-truth and predicted motion fields (set to zero if $\mathbb{E}[m_{pred}]/\mathbb{E}[m_{gt}] < 0.01$ ).
Preference Metric: Win Rate, the proportion of head-to-head victories in blind A vs. B comparisons by the generative evaluator (Wan et al., 11 Dec 2025).

6. Experimental Results and Analysis

Applying MotionNFT to FLUX.1 Kontext and Qwen-Image-Edit produced notable performance improvements:

Model	Overall ↑	Fidelity ↑	Preservation ↑	Coherence ↑	MAS ↑	Win Rate ↑
FLUX.1 Kontext	3.84→4.25 (+10.7%)	3.89→4.33 (+11.3%)	3.79→4.16 (+9.8%)	3.83→4.25 (+11.0%)	53.73→55.45 (+3.2%)	57.97%→65.16%
Qwen-Image-Edit	4.65→4.72 (+1.5%)	4.70→4.79 (+1.9%)	4.59→4.63 (+0.9%)	4.66→4.74 (+1.7%)	56.46→57.23 (+1.3%)	72.99%→73.87%

These gains reflect enhanced fidelity, coherence, and geometric plausibility for motion-centric edits, such as displacing vehicles, rotating limbs, and executing viewpoint transformations with minimal identity loss and background distortion (Wan et al., 11 Dec 2025).

Ablation studies confirm the necessity of blending semantic (MLLM) and geometric (flow-based) rewards: pure $r_{motion}$ degrades overall quality, while pure MLLM-based reward impairs motion accuracy (MAS), with the 50/50 mixture yielding a robust compromise. Further, the integration of explicit flow supervision through MotionNFT mitigates overfitting to semantics and ensures sustained motion alignment improvements throughout optimization.

7. Remaining Challenges and Future Directions

Notwithstanding its advances, MotionNFT faces persistent limitations concerning scenes with multiple interacting or partially occluded subjects, execution of intricate or multi-step motions, and the handling of 3D occlusion phenomena. The methodology suggests potential for incorporating physics-informed priors, multi-view supervision, or video-based temporal coherence mechanisms to further augment motion fidelity. These areas present future research avenues for enhancing the physical and perceptual realism of motion-centric image editing systems (Wan et al., 11 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MotionNFT.