Motion Alignment Score (MAS) Overview
- Motion Alignment Score (MAS) is a metric that quantifies motion fidelity by comparing predicted optical flow fields with ground truth using normalized displacement measures.
- It computes both magnitude and directional deviations, applying normalization and weighting to ensure scale invariance and robustness against outliers.
- MAS is integrated into training frameworks like MotionNFT, serving as an evaluation metric and reward signal to enhance motion-centric image editing performance.
The Motion Alignment Score (MAS) quantifies the fidelity and precision of motion-centric image edits by comparing predicted optical flow fields between model outputs and ground-truth targets. Introduced in the context of the MotionNFT fine-tuning framework for the MotionEdit-Bench as presented in "MotionEdit: Benchmarking and Learning Motion-Centric Image Editing," MAS serves as both an evaluation metric and a reward signal guiding the training of generative models to achieve accurate motion transformations while preserving semantic and identity consistency (Wan et al., 11 Dec 2025).
1. Formal Definition and Mathematical Construction
MAS is computed by first extracting optical flow fields between the original image and both the edited sample and the ground-truth image using a pretrained flow network (such as UniMatch). For images , , and , the corresponding flows are: Normalization by the image diagonal yields resolution-agnostic displacements:
Two per-pixel deviations are computed:
- Magnitude deviation:
- Direction deviation (for all pixels with sufficient GT motion ):
Directional errors are weighted: , giving
A weighted overlay combines both: Final MAS is normalized and clipped to : If overall predicted motion is negligible , then .
2. Optical Flow Field Estimation
MAS relies critically on robust flow estimation between image pairs. The choice of (pretrained flow network) can include UniMatch, RAFT, or GMFlow. Both and pairs are processed through , producing dense displacement fields in pixel coordinates. All resulting flows are divided by the diagonal to ensure scale invariance across resolutions (Wan et al., 11 Dec 2025).
3. Normalization, Weighting, and Thresholding in MAS
The MAS formulation incorporates several normalization and weighting mechanisms to address outlier suppression, dynamic relevance, and scale alignment:
- Pixel-wise magnitude deviations are raised to exponent () for outlier robustness.
- Directional errors are weighted by the relative magnitude of ground-truth motion; static or nearly static pixels (, ) are excluded.
- The overlay distance is produced through a convex combination with empirically selected coefficients ().
- The final MAS score is derived by mapping into a bounded range, shifted and scaled relative to dataset-specific , and applying a hard zeroing rule if model-applied motion is extremely weak ().
4. Integration of MAS into Training Objectives
Within the MotionNFT framework, MAS is implemented as a core reward signal during negative-aware fine-tuning of diffusion models. The process operates:
- For each training instance, samples are generated in response to an editing instruction.
- MAS is calculated for each, then discretized to .
- A parallel “MLLM reward” assesses instruction fidelity and stylistic alignment.
- The final reward is an affine combination: ; .
- Rewards are group-wise normalized and used to interpolate between positive and negative velocity terms in the diffusion flow-matching loss:
This design ensures that model updates are directly sensitive to the degree of motion alignment achieved, as quantified by MAS, balanced against general editing quality metrics (Wan et al., 11 Dec 2025).
High-level Pseudocode of the MAS-based Training Loop
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
for step in 1..N_steps: sample a minibatch of (image I_orig, instruction c, target I_gt) for each example in batch: # 1) generate k candidate edits {I_edit^i = model.sample(c, I_orig)}_{i=1..k} # 2) compute optical flows V_gt = FlowNet(I_orig, I_gt) # H×W×2 for i=1..k: V_pred = FlowNet(I_orig, I_edit^i) # 3) compute D_mag, D_dir per formulas above D_mag = mean_{i,j} (||~V_pred - ~V_gt||_1 + eps)^q D_dir = sum_{i,j} w(i,j)*e_dir(i,j) / sum w(i,j) D_ovl = alpha*D_mag + beta*D_dir # 4) normalize and invert into [0,1] norm = clip((D_ovl - D_min)/(D_max-D_min),0,1) r_motion^i = 1 - norm # 5) quantize to {0,0.2,...,1.0} r_motion^i = round(5*r_motion^i)/5 # 6) query MLLM reward r_mllm^i for each I_edit^i # 7) combine raw reward r_raw^i = lambda_motion * r_motion^i + (1-lambda_motion)*r_mllm^i # 8) groupwise normalize {r_raw^i} → {r_i} in [0,1] (Diffusion NFT) # 9) form positive/negative velocity terms v^+_θ, v^-_θ # 10) compute loss L(θ) = E_i [ r_i||v^+_θ - v_target||^2 + (1-r_i)||v^-_θ - v_target||^2 ] optimizer.step(L) |
5. Ablation Studies and Sensitivity Analysis
Ablation experiments demonstrate the importance and optimal usage of MAS in fine-tuning:
- Varying in the reward blend affects final alignment: pure-motion () underperforms relative to the best mixed setting (); pure-MLLM () yields higher semantic fidelity but reduced geometric precision.
- MAS alone is insufficient for highest visual quality, but its inclusion is critical for accurate motion transfer.
- During training, policies optimizing only MLLM rewards plateau or degrade in MAS, while MAS-guided optimization produces consistent improvements (+1–2 MAS points).
- On MotionEdit-Bench, baseline diffusion models achieve MAS , while MotionNFT-tuned variants reach MAS . The average MAS gain observed with MotionNFT is 1.2–1.7 points over competitive baselines (see Tables and Figures in (Wan et al., 11 Dec 2025)).
| Reward Mixing () | Visual Fidelity | MAS Precision |
|---|---|---|
| 0.0 (MLLM only) | High | Low |
| 0.5 (Optimal blend) | High | High |
| 1.0 (MAS only) | Lower | Moderate (Someartifacts) |
A plausible implication is that MAS, while robust for motion fidelity, must be harmonized with broader perceptual signals to yield semantically and visually optimal results.
6. Empirical Ranges and Practical Implications
Typical MAS values, as reported on MotionEdit-Bench, scale from 18 for weak baselines up to 57 for MotionNFT-tuned models, calibrated on a [0,100] scale. These figures characterize both the difficulty of the motion-centric editing task and the incremental benefits yielded by MAS-guided fine-tuning. The zeroing rule, which sets MAS to zero for negligible predicted motion, prevents reward leakage in trivial or static cases, ensuring that the metric remains informative and reliable in overseeing meaningful edits (Wan et al., 11 Dec 2025).
This rigorous construction and the empirical performance of MAS establish it as a specialized, quantitatively sensitive metric for training and benchmarking motion-centric image editing systems.