MotionNFT: Motion-Aware Fine Tuning

Updated 12 December 2025

MotionNFT is a framework that enhances motion-centric image editing by combining optical-flow-based rewards with negative-aware fine tuning.
It employs dual policy sampling to optimize spatial displacement accuracy while preserving non-motion features such as subject identity and structural integrity.
Experimental evaluations on FLUX.1 Kontext and Qwen-Image-Edit show significant improvements in motion fidelity, with notable gains in MAS and overall generative quality.

MotionNFT (Motion-guided Negative-aware Fine Tuning) is a post-training framework specifically developed for enhancing motion-centric image editing tasks. It operates by integrating an optical-flow-based motion alignment reward within a negative-aware fine-tuning objective applied to established flow-matching diffusion architectures. The principal goal of MotionNFT is to improve the accuracy of spatial displacements in edited images, ensuring both high-fidelity motion transformations and preservation of non-motion features—including subject identity, structural integrity, and physical plausibility—without compromising general editing capabilities (Wan et al., 11 Dec 2025).

1. Motion-Centric Image Editing: Problem Formulation

Motion-centric image editing addresses the challenge of transforming the depicted actions or interactions in an image, rather than static attributes such as color or texture. Each edit instance is represented by a triplet: the original image $I_{\text{orig}} \in \mathbb{R}^{H \times W \times 3}$ , a natural-language instruction $c$ specifying the desired motion, and a ground-truth edited image $I_{\text{gt}} \in \mathbb{R}^{H \times W \times 3}$ embodying the intended motion transformation. The output image $I_{\text{edit}}$ must satisfy three core constraints: (1) the pixelwise spatial displacement between $I_{\text{orig}}$ and $I_{\text{edit}}$ closely matches that from $I_{\text{orig}}$ to $I_{\text{gt}}$ ; (2) non-motion characteristics are preserved; and (3) the result remains physically plausible.

Optical flow is central to the formulation: a pretrained estimator $\mathcal{F}(\cdot, \cdot)$ (UniMatch) produces ground-truth flow $V_{\text{gt}}=\mathcal{F}(I_{\text{orig}}, I_{\text{gt}})$ and predicted flow $V_{\text{pred}}=\mathcal{F}(I_{\text{orig}}, I_{\text{edit}})$ , enabling quantitative evaluation of the edited motion.

2. Algorithmic Framework and Fine-Tuning Objective

2.1 Base Architecture

MotionNFT operates atop existing flow-matching models (FMMs), including FLUX.1 Kontext and Qwen-Image-Edit. These models employ a flow-matching approach wherein a neural predictor $v_\theta(x_t, t, c)$ estimates the velocity field that reverses stochastic noise in latent $x_t$ toward the clean image $x_0$ , conditioned on instruction $c$ .

2.2 Negative-Aware Fine-Tuning (NFT)

Core to MotionNFT is its negative-aware policy-gradient update. Each training instance contrasts "positive" ( $v_\theta^+$ ) and "negative" ( $v_\theta^-$ ) policy samples: $v_\theta^+(x_t,c,t) = (1-\beta)v^{\text{old}}(x_t,c,t) + \beta v_\theta(x_t,c,t),$

$v_\theta^-(x_t,c,t) = (1+\beta)v^{\text{old}}(x_t,c,t) - \beta v_\theta(x_t,c,t).$

The NFT loss, parameterized by a scalar optimality reward $r \in [0,1]$ , is

$\mathcal{L}_{\text{NFT}}(\theta) = \mathbb{E}_{c, x_0 \sim \pi^{\text{old}}(\cdot | c), t}\left[ r \| v_\theta^+(x_t, c, t) - v \|_2^2 + (1 - r) \| v_\theta^-(x_t, c, t) - v \|_2^2 \right],$

where $v$ denotes the target deterministic velocity.

3. Motion Alignment Reward: Optical Flow-Based Scoring

The motion alignment reward $r_{\text{motion}}$ quantitatively assesses the fidelity of motion transformation using optical flow analysis at the pixel level.

Magnitude Consistency: For normalized flows $\tilde V_{\text{pred}}(i,j)$ and $\tilde V_{\text{gt}}(i,j)$ at pixel $(i,j)$ ,

$\mathcal{D}_{\text{mag}} = \frac{1}{HW} \sum_{i,j} ( \|\tilde V_{\text{pred}}(i,j) - \tilde V_{\text{gt}}(i,j)\|_1 + \varepsilon )^q.$

Direction Consistency: Project onto unit vectors, weighted by ground-truth magnitude,

$e_{\text{dir}}(i,j) = \tfrac{1}{2}(1 - \hat v_{\text{pred}}(i,j)^\top \hat v_{\text{gt}}(i,j)),$

$w(i,j) = \frac{m_{\text{gt}}(i,j)}{ \max_{u,v} m_{\text{gt}}(u,v) + \varepsilon } \mathbf{1}[ m_{\text{gt}}(i,j) > \tau_m ] ,$

$\mathcal{D}_{\text{dir}} = \frac{ \sum_{i,j} w(i,j) e_{\text{dir}}(i,j) }{ \sum_{i,j} w(i,j) + \varepsilon }.$

Movement Regularization: Controls for trivial edits via

$\bar m_{\text{gt}} = \tfrac{1}{HW} \sum m_{\text{gt}}, \quad \bar m_{\text{pred}} = \tfrac{1}{HW} \sum m_{\text{pred}},$

$M_{\text{move}} = \max \left\{ 0, \tau + \tfrac{1}{2} ( \bar m_{\text{gt}} - \bar m_{\text{pred}} ) \right\}.$

Composite Reward Score: Combines components

$\mathcal{D}_{\text{comb}} = \alpha \mathcal{D}_{\text{mag}} + \beta \mathcal{D}_{\text{dir}} + \lambda_{\text{move}} M_{\text{move}},$

normalized and quantized: $r_{\text{motion}} \in \{0.0, 0.2, 0.4, 0.6, 0.8, 1.0\}$ . The final reward combines motion alignment and semantic fidelity (evaluated by a MLLM, e.g., Gemini): $r_{\text{raw}} = \lambda r_{\text{motion}} + (1-\lambda) r_{\text{mllm}},$ with $\lambda=0.5$ .

4. Implementation, Training, and Hyperparameters

MotionNFT is instantiated on two principal FMM backbones: FLUX.1 Kontext and Qwen-Image-Edit.

Optimization: AdamW optimizer, learning rate $3 \times 10^{-4}$ , batch size 2, KL-loss weight $1 \times 10^{-4}$ , guidance 1.0.
Training Regime: 300 steps (FLUX), 210 steps (Qwen), NFT sampling: 6 diffusion steps, 8 samples per prompt, 24 groups.
Reward Parameters: $\alpha=0.7$ , $\beta=0.2$ , $\lambda_{\text{move}}=0.1$ , $q=0.4$ , $\varepsilon=10^{-6}$ , thresholds $\tau=0.01$ , $\tau_m=0.01$ .
Inference: 28 steps, CFG guidance 4.0 for both architectures.

5. Experimental Benchmarks and Results

5.1 Datasets

MotionEdit dataset: 10,157 image-instruction-target triplets, mined from continuous video sequences. Training split: 9,142; evaluation split (MotionEdit-Bench): 1,015 samples. Edits primarily involve pose, locomotion, orientation, state changes, subject–object and inter-subject interactions.

5.2 Evaluation Metrics

Generative (MLLM-based): Fidelity, Preservation, Coherence, Overall (0–5 scale).
Discriminative: Motion Alignment Score (MAS, 0–100), determined by the optical flow metric.
Preference: Win rates from pairwise head-to-head MLLM evaluations.

5.3 Quantitative Outcomes

Model	Overall (Before→After)	MAS (Before→After)	Win % (Before→After)
FLUX.1 Kontext+MotionNFT	3.84 → 4.25 (+10.7%)	53.73 → 55.45	57.97 → 65.16
Qwen-Image-Edit+MotionNFT	4.65 → 4.72	56.46 → 57.23	72.99 → 73.87

MotionNFT yields consistent improvements in motion fidelity and MLLM-based generative quality. Baseline models frequently "freeze" subjects or generate erroneous poses, while models fine-tuned with MotionNFT closely align predicted flows with ground-truth motion, as documented in visual comparisons.

6. Ablation Analyses

6.1 Reward Balancing

Varying the reward balance parameter $\lambda$ in $r_{\text{raw}}$ demonstrates that exclusive reliance on motion reward $(\lambda=1.0)$ degrades output quality (Overall=3.60 for FLUX). Conversely, MLLM-only training $(\lambda=0.0,\;$ “UniWorld-V2”) provides modest gains (Overall $\approx$ 4.20), with optimality observed at $\lambda=0.5$ (matching reported figures).

6.2 Motion Alignment Score Trends

MLLM-only fine-tuning often stagnates or reduces MAS mid-training, whereas MotionNFT ensures progressive MAS improvement across training epochs. This suggests that enforcing explicit motion alignment reward is critical for continual gains in motion fidelity.

6.3 Preservation of General Editing Ability

On ImgEdit-Bench (static, non-motion editing tasks), MotionNFT maintains or slightly improves generative scores compared to unfine-tuned base architectures, indicating no trade-off with general editing capacity.

7. Conceptual Significance and Outlook

MotionNFT represents a rigorous framework for enhancing motion-centric edits in generative diffusion models. By fusing optical-flow-based motion alignment evaluation with negative-aware fine-tuning, it addresses deficiencies in baseline models which often fail to capture dynamic spatial transformations. Integration with both FLUX.1 Kontext and Qwen-Image-Edit demonstrates its architecture-agnostic potential. A plausible implication is the extension of this reward paradigm to downstream tasks such as controllable video synthesis, animation, and multiperson interaction modeling, where accurate representation of spatial dynamics is essential.

PDF Markdown Chat (Pro)

References (1)

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MotionNFT (Motion-guided Negative-aware Fine Tuning).