Papers
Topics
Authors
Recent
2000 character limit reached

MotionNFT: Motion-Aware Fine Tuning

Updated 12 December 2025
  • MotionNFT is a framework that enhances motion-centric image editing by combining optical-flow-based rewards with negative-aware fine tuning.
  • It employs dual policy sampling to optimize spatial displacement accuracy while preserving non-motion features such as subject identity and structural integrity.
  • Experimental evaluations on FLUX.1 Kontext and Qwen-Image-Edit show significant improvements in motion fidelity, with notable gains in MAS and overall generative quality.

MotionNFT (Motion-guided Negative-aware Fine Tuning) is a post-training framework specifically developed for enhancing motion-centric image editing tasks. It operates by integrating an optical-flow-based motion alignment reward within a negative-aware fine-tuning objective applied to established flow-matching diffusion architectures. The principal goal of MotionNFT is to improve the accuracy of spatial displacements in edited images, ensuring both high-fidelity motion transformations and preservation of non-motion features—including subject identity, structural integrity, and physical plausibility—without compromising general editing capabilities (Wan et al., 11 Dec 2025).

1. Motion-Centric Image Editing: Problem Formulation

Motion-centric image editing addresses the challenge of transforming the depicted actions or interactions in an image, rather than static attributes such as color or texture. Each edit instance is represented by a triplet: the original image IorigRH×W×3I_{\text{orig}} \in \mathbb{R}^{H \times W \times 3}, a natural-language instruction cc specifying the desired motion, and a ground-truth edited image IgtRH×W×3I_{\text{gt}} \in \mathbb{R}^{H \times W \times 3} embodying the intended motion transformation. The output image IeditI_{\text{edit}} must satisfy three core constraints: (1) the pixelwise spatial displacement between IorigI_{\text{orig}} and IeditI_{\text{edit}} closely matches that from IorigI_{\text{orig}} to IgtI_{\text{gt}}; (2) non-motion characteristics are preserved; and (3) the result remains physically plausible.

Optical flow is central to the formulation: a pretrained estimator F(,)\mathcal{F}(\cdot, \cdot) (UniMatch) produces ground-truth flow Vgt=F(Iorig,Igt)V_{\text{gt}}=\mathcal{F}(I_{\text{orig}}, I_{\text{gt}}) and predicted flow Vpred=F(Iorig,Iedit)V_{\text{pred}}=\mathcal{F}(I_{\text{orig}}, I_{\text{edit}}), enabling quantitative evaluation of the edited motion.

2. Algorithmic Framework and Fine-Tuning Objective

2.1 Base Architecture

MotionNFT operates atop existing flow-matching models (FMMs), including FLUX.1 Kontext and Qwen-Image-Edit. These models employ a flow-matching approach wherein a neural predictor vθ(xt,t,c)v_\theta(x_t, t, c) estimates the velocity field that reverses stochastic noise in latent xtx_t toward the clean image x0x_0, conditioned on instruction cc.

2.2 Negative-Aware Fine-Tuning (NFT)

Core to MotionNFT is its negative-aware policy-gradient update. Each training instance contrasts "positive" (vθ+v_\theta^+) and "negative" (vθv_\theta^-) policy samples: vθ+(xt,c,t)=(1β)vold(xt,c,t)+βvθ(xt,c,t),v_\theta^+(x_t,c,t) = (1-\beta)v^{\text{old}}(x_t,c,t) + \beta v_\theta(x_t,c,t),

vθ(xt,c,t)=(1+β)vold(xt,c,t)βvθ(xt,c,t).v_\theta^-(x_t,c,t) = (1+\beta)v^{\text{old}}(x_t,c,t) - \beta v_\theta(x_t,c,t).

The NFT loss, parameterized by a scalar optimality reward r[0,1]r \in [0,1], is

LNFT(θ)=Ec,x0πold(c),t[rvθ+(xt,c,t)v22+(1r)vθ(xt,c,t)v22],\mathcal{L}_{\text{NFT}}(\theta) = \mathbb{E}_{c, x_0 \sim \pi^{\text{old}}(\cdot | c), t}\left[ r \| v_\theta^+(x_t, c, t) - v \|_2^2 + (1 - r) \| v_\theta^-(x_t, c, t) - v \|_2^2 \right],

where vv denotes the target deterministic velocity.

3. Motion Alignment Reward: Optical Flow-Based Scoring

The motion alignment reward rmotionr_{\text{motion}} quantitatively assesses the fidelity of motion transformation using optical flow analysis at the pixel level.

  • Magnitude Consistency: For normalized flows V~pred(i,j)\tilde V_{\text{pred}}(i,j) and V~gt(i,j)\tilde V_{\text{gt}}(i,j) at pixel (i,j)(i,j),

Dmag=1HWi,j(V~pred(i,j)V~gt(i,j)1+ε)q.\mathcal{D}_{\text{mag}} = \frac{1}{HW} \sum_{i,j} ( \|\tilde V_{\text{pred}}(i,j) - \tilde V_{\text{gt}}(i,j)\|_1 + \varepsilon )^q.

  • Direction Consistency: Project onto unit vectors, weighted by ground-truth magnitude,

edir(i,j)=12(1v^pred(i,j)v^gt(i,j)),e_{\text{dir}}(i,j) = \tfrac{1}{2}(1 - \hat v_{\text{pred}}(i,j)^\top \hat v_{\text{gt}}(i,j)),

w(i,j)=mgt(i,j)maxu,vmgt(u,v)+ε1[mgt(i,j)>τm],w(i,j) = \frac{m_{\text{gt}}(i,j)}{ \max_{u,v} m_{\text{gt}}(u,v) + \varepsilon } \mathbf{1}[ m_{\text{gt}}(i,j) > \tau_m ] ,

Ddir=i,jw(i,j)edir(i,j)i,jw(i,j)+ε.\mathcal{D}_{\text{dir}} = \frac{ \sum_{i,j} w(i,j) e_{\text{dir}}(i,j) }{ \sum_{i,j} w(i,j) + \varepsilon }.

  • Movement Regularization: Controls for trivial edits via

mˉgt=1HWmgt,mˉpred=1HWmpred,\bar m_{\text{gt}} = \tfrac{1}{HW} \sum m_{\text{gt}}, \quad \bar m_{\text{pred}} = \tfrac{1}{HW} \sum m_{\text{pred}},

Mmove=max{0,τ+12(mˉgtmˉpred)}.M_{\text{move}} = \max \left\{ 0, \tau + \tfrac{1}{2} ( \bar m_{\text{gt}} - \bar m_{\text{pred}} ) \right\}.

  • Composite Reward Score: Combines components

Dcomb=αDmag+βDdir+λmoveMmove,\mathcal{D}_{\text{comb}} = \alpha \mathcal{D}_{\text{mag}} + \beta \mathcal{D}_{\text{dir}} + \lambda_{\text{move}} M_{\text{move}},

normalized and quantized: rmotion{0.0,0.2,0.4,0.6,0.8,1.0}r_{\text{motion}} \in \{0.0, 0.2, 0.4, 0.6, 0.8, 1.0\}. The final reward combines motion alignment and semantic fidelity (evaluated by a MLLM, e.g., Gemini): rraw=λrmotion+(1λ)rmllm,r_{\text{raw}} = \lambda r_{\text{motion}} + (1-\lambda) r_{\text{mllm}}, with λ=0.5\lambda=0.5.

4. Implementation, Training, and Hyperparameters

MotionNFT is instantiated on two principal FMM backbones: FLUX.1 Kontext and Qwen-Image-Edit.

  • Optimization: AdamW optimizer, learning rate 3×1043 \times 10^{-4}, batch size 2, KL-loss weight 1×1041 \times 10^{-4}, guidance 1.0.
  • Training Regime: 300 steps (FLUX), 210 steps (Qwen), NFT sampling: 6 diffusion steps, 8 samples per prompt, 24 groups.
  • Reward Parameters: α=0.7\alpha=0.7, β=0.2\beta=0.2, λmove=0.1\lambda_{\text{move}}=0.1, q=0.4q=0.4, ε=106\varepsilon=10^{-6}, thresholds τ=0.01\tau=0.01, τm=0.01\tau_m=0.01.
  • Inference: 28 steps, CFG guidance 4.0 for both architectures.

5. Experimental Benchmarks and Results

5.1 Datasets

MotionEdit dataset: 10,157 image-instruction-target triplets, mined from continuous video sequences. Training split: 9,142; evaluation split (MotionEdit-Bench): 1,015 samples. Edits primarily involve pose, locomotion, orientation, state changes, subject–object and inter-subject interactions.

5.2 Evaluation Metrics

  • Generative (MLLM-based): Fidelity, Preservation, Coherence, Overall (0–5 scale).
  • Discriminative: Motion Alignment Score (MAS, 0–100), determined by the optical flow metric.
  • Preference: Win rates from pairwise head-to-head MLLM evaluations.

5.3 Quantitative Outcomes

Model Overall (Before→After) MAS (Before→After) Win % (Before→After)
FLUX.1 Kontext+MotionNFT 3.84 → 4.25 (+10.7%) 53.73 → 55.45 57.97 → 65.16
Qwen-Image-Edit+MotionNFT 4.65 → 4.72 56.46 → 57.23 72.99 → 73.87

MotionNFT yields consistent improvements in motion fidelity and MLLM-based generative quality. Baseline models frequently "freeze" subjects or generate erroneous poses, while models fine-tuned with MotionNFT closely align predicted flows with ground-truth motion, as documented in visual comparisons.

6. Ablation Analyses

6.1 Reward Balancing

Varying the reward balance parameter λ\lambda in rrawr_{\text{raw}} demonstrates that exclusive reliance on motion reward (λ=1.0)(\lambda=1.0) degrades output quality (Overall=3.60 for FLUX). Conversely, MLLM-only training (λ=0.0,  (\lambda=0.0,\;“UniWorld-V2”) provides modest gains (Overall\approx4.20), with optimality observed at λ=0.5\lambda=0.5 (matching reported figures).

MLLM-only fine-tuning often stagnates or reduces MAS mid-training, whereas MotionNFT ensures progressive MAS improvement across training epochs. This suggests that enforcing explicit motion alignment reward is critical for continual gains in motion fidelity.

6.3 Preservation of General Editing Ability

On ImgEdit-Bench (static, non-motion editing tasks), MotionNFT maintains or slightly improves generative scores compared to unfine-tuned base architectures, indicating no trade-off with general editing capacity.

7. Conceptual Significance and Outlook

MotionNFT represents a rigorous framework for enhancing motion-centric edits in generative diffusion models. By fusing optical-flow-based motion alignment evaluation with negative-aware fine-tuning, it addresses deficiencies in baseline models which often fail to capture dynamic spatial transformations. Integration with both FLUX.1 Kontext and Qwen-Image-Edit demonstrates its architecture-agnostic potential. A plausible implication is the extension of this reward paradigm to downstream tasks such as controllable video synthesis, animation, and multiperson interaction modeling, where accurate representation of spatial dynamics is essential.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MotionNFT (Motion-guided Negative-aware Fine Tuning).