MotionNFT: Motion-Aware Fine Tuning
- MotionNFT is a framework that enhances motion-centric image editing by combining optical-flow-based rewards with negative-aware fine tuning.
- It employs dual policy sampling to optimize spatial displacement accuracy while preserving non-motion features such as subject identity and structural integrity.
- Experimental evaluations on FLUX.1 Kontext and Qwen-Image-Edit show significant improvements in motion fidelity, with notable gains in MAS and overall generative quality.
MotionNFT (Motion-guided Negative-aware Fine Tuning) is a post-training framework specifically developed for enhancing motion-centric image editing tasks. It operates by integrating an optical-flow-based motion alignment reward within a negative-aware fine-tuning objective applied to established flow-matching diffusion architectures. The principal goal of MotionNFT is to improve the accuracy of spatial displacements in edited images, ensuring both high-fidelity motion transformations and preservation of non-motion features—including subject identity, structural integrity, and physical plausibility—without compromising general editing capabilities (Wan et al., 11 Dec 2025).
1. Motion-Centric Image Editing: Problem Formulation
Motion-centric image editing addresses the challenge of transforming the depicted actions or interactions in an image, rather than static attributes such as color or texture. Each edit instance is represented by a triplet: the original image , a natural-language instruction specifying the desired motion, and a ground-truth edited image embodying the intended motion transformation. The output image must satisfy three core constraints: (1) the pixelwise spatial displacement between and closely matches that from to ; (2) non-motion characteristics are preserved; and (3) the result remains physically plausible.
Optical flow is central to the formulation: a pretrained estimator (UniMatch) produces ground-truth flow and predicted flow , enabling quantitative evaluation of the edited motion.
2. Algorithmic Framework and Fine-Tuning Objective
2.1 Base Architecture
MotionNFT operates atop existing flow-matching models (FMMs), including FLUX.1 Kontext and Qwen-Image-Edit. These models employ a flow-matching approach wherein a neural predictor estimates the velocity field that reverses stochastic noise in latent toward the clean image , conditioned on instruction .
2.2 Negative-Aware Fine-Tuning (NFT)
Core to MotionNFT is its negative-aware policy-gradient update. Each training instance contrasts "positive" () and "negative" () policy samples:
The NFT loss, parameterized by a scalar optimality reward , is
where denotes the target deterministic velocity.
3. Motion Alignment Reward: Optical Flow-Based Scoring
The motion alignment reward quantitatively assesses the fidelity of motion transformation using optical flow analysis at the pixel level.
- Magnitude Consistency: For normalized flows and at pixel ,
- Direction Consistency: Project onto unit vectors, weighted by ground-truth magnitude,
- Movement Regularization: Controls for trivial edits via
- Composite Reward Score: Combines components
normalized and quantized: . The final reward combines motion alignment and semantic fidelity (evaluated by a MLLM, e.g., Gemini): with .
4. Implementation, Training, and Hyperparameters
MotionNFT is instantiated on two principal FMM backbones: FLUX.1 Kontext and Qwen-Image-Edit.
- Optimization: AdamW optimizer, learning rate , batch size 2, KL-loss weight , guidance 1.0.
- Training Regime: 300 steps (FLUX), 210 steps (Qwen), NFT sampling: 6 diffusion steps, 8 samples per prompt, 24 groups.
- Reward Parameters: , , , , , thresholds , .
- Inference: 28 steps, CFG guidance 4.0 for both architectures.
5. Experimental Benchmarks and Results
5.1 Datasets
MotionEdit dataset: 10,157 image-instruction-target triplets, mined from continuous video sequences. Training split: 9,142; evaluation split (MotionEdit-Bench): 1,015 samples. Edits primarily involve pose, locomotion, orientation, state changes, subject–object and inter-subject interactions.
5.2 Evaluation Metrics
- Generative (MLLM-based): Fidelity, Preservation, Coherence, Overall (0–5 scale).
- Discriminative: Motion Alignment Score (MAS, 0–100), determined by the optical flow metric.
- Preference: Win rates from pairwise head-to-head MLLM evaluations.
5.3 Quantitative Outcomes
| Model | Overall (Before→After) | MAS (Before→After) | Win % (Before→After) |
|---|---|---|---|
| FLUX.1 Kontext+MotionNFT | 3.84 → 4.25 (+10.7%) | 53.73 → 55.45 | 57.97 → 65.16 |
| Qwen-Image-Edit+MotionNFT | 4.65 → 4.72 | 56.46 → 57.23 | 72.99 → 73.87 |
MotionNFT yields consistent improvements in motion fidelity and MLLM-based generative quality. Baseline models frequently "freeze" subjects or generate erroneous poses, while models fine-tuned with MotionNFT closely align predicted flows with ground-truth motion, as documented in visual comparisons.
6. Ablation Analyses
6.1 Reward Balancing
Varying the reward balance parameter in demonstrates that exclusive reliance on motion reward degrades output quality (Overall=3.60 for FLUX). Conversely, MLLM-only training “UniWorld-V2”) provides modest gains (Overall4.20), with optimality observed at (matching reported figures).
6.2 Motion Alignment Score Trends
MLLM-only fine-tuning often stagnates or reduces MAS mid-training, whereas MotionNFT ensures progressive MAS improvement across training epochs. This suggests that enforcing explicit motion alignment reward is critical for continual gains in motion fidelity.
6.3 Preservation of General Editing Ability
On ImgEdit-Bench (static, non-motion editing tasks), MotionNFT maintains or slightly improves generative scores compared to unfine-tuned base architectures, indicating no trade-off with general editing capacity.
7. Conceptual Significance and Outlook
MotionNFT represents a rigorous framework for enhancing motion-centric edits in generative diffusion models. By fusing optical-flow-based motion alignment evaluation with negative-aware fine-tuning, it addresses deficiencies in baseline models which often fail to capture dynamic spatial transformations. Integration with both FLUX.1 Kontext and Qwen-Image-Edit demonstrates its architecture-agnostic potential. A plausible implication is the extension of this reward paradigm to downstream tasks such as controllable video synthesis, animation, and multiperson interaction modeling, where accurate representation of spatial dynamics is essential.