Motion Alignment Score (MAS) Overview

Updated 17 December 2025

Motion Alignment Score (MAS) is a metric that quantifies motion fidelity by comparing predicted optical flow fields with ground truth using normalized displacement measures.
It computes both magnitude and directional deviations, applying normalization and weighting to ensure scale invariance and robustness against outliers.
MAS is integrated into training frameworks like MotionNFT, serving as an evaluation metric and reward signal to enhance motion-centric image editing performance.

The Motion Alignment Score (MAS) quantifies the fidelity and precision of motion-centric image edits by comparing predicted optical flow fields between model outputs and ground-truth targets. Introduced in the context of the MotionNFT fine-tuning framework for the MotionEdit-Bench as presented in "MotionEdit: Benchmarking and Learning Motion-Centric Image Editing," MAS serves as both an evaluation metric and a reward signal guiding the training of generative models to achieve accurate motion transformations while preserving semantic and identity consistency (Wan et al., 11 Dec 2025).

1. Formal Definition and Mathematical Construction

MAS is computed by first extracting optical flow fields between the original image and both the edited sample and the ground-truth image using a pretrained flow network $\mathcal F$ (such as UniMatch). For images $I_{\rm orig}\in\mathbb R^{H\times W\times3}$ , $I_{\rm edited}$ , and $I_{\rm gt}$ , the corresponding flows are: $V_{\rm pred} = \mathcal F(I_{\rm orig},I_{\rm edited}) \quad V_{\rm gt} = \mathcal F(I_{\rm orig},I_{\rm gt}) \quad (V\in \mathbb R^{H\times W\times2})$ Normalization by the image diagonal $d=\sqrt{H^2+W^2}$ yields resolution-agnostic displacements: $\tilde V_{\rm pred}(i,j)=V_{\rm pred}(i,j)/d,\quad \tilde V_{\rm gt}(i,j)=V_{\rm gt}(i,j)/d$

Two per-pixel deviations are computed:

Magnitude deviation:

$D_{\rm mag} = \frac{1}{HW}\sum_{i,j} \left\|\tilde V_{\rm pred}(i,j)-\tilde V_{\rm gt}(i,j)\right\|_1^q \quad\text{with}\; q=0.4$

Direction deviation (for all pixels with sufficient GT motion $m_{\rm gt}(i,j)>\tau_m$ ):

$m_{\rm gt}(i,j)=\|\tilde V_{\rm gt}(i,j)\|_2,\quad \hat v_{\rm gt}(i,j)=\frac{\tilde V_{\rm gt}(i,j)}{m_{\rm gt}(i,j)+\epsilon}$

$\hat v_{\rm pred}(i,j)=\frac{\tilde V_{\rm pred}(i,j)}{\|\tilde V_{\rm pred}(i,j)\|_2+\epsilon}$

$e_{\rm dir}(i,j)=\frac{1}{2}(1-\hat v_{\rm pred}(i,j)^\top\hat v_{\rm gt}(i,j))$

Directional errors are weighted: $w(i,j)=\frac{m_{\rm gt}(i,j)}{\max_{u,v} m_{\rm gt}(u,v) + \epsilon } \mathbf1[m_{\rm gt}(i,j)>\tau_m]$ , giving

$D_{\rm dir} = \frac{\sum_{i,j} w(i,j) e_{\rm dir}(i,j)}{\sum_{i,j} w(i,j) + \epsilon}$

A weighted overlay combines both: $D_{\rm ovl} = \alpha D_{\rm mag} + \beta D_{\rm dir} \quad (\alpha=0.7,\; \beta=0.3)$ Final MAS is normalized and clipped to $[0,100]$ : $\text{MAS} = 100 \left(1 - \operatorname{clip}\left(\frac{D_{\rm ovl} - d_{\min}}{d_{\max}-d_{\min}},0,1\right)\right)$ If overall predicted motion is negligible $\big( \mathbb E[m_{\rm pred}]/\mathbb E[m_{\rm gt}]<\rho_{\min} \big)$ , then $\text{MAS}=0$ .

2. Optical Flow Field Estimation

MAS relies critically on robust flow estimation between image pairs. The choice of $\mathcal F$ (pretrained flow network) can include UniMatch, RAFT, or GMFlow. Both $(I_{\rm orig}, I_{\rm edited})$ and $(I_{\rm orig}, I_{\rm gt})$ pairs are processed through $\mathcal F$ , producing dense displacement fields in pixel coordinates. All resulting flows are divided by the diagonal $d$ to ensure scale invariance across resolutions (Wan et al., 11 Dec 2025).

3. Normalization, Weighting, and Thresholding in MAS

The MAS formulation incorporates several normalization and weighting mechanisms to address outlier suppression, dynamic relevance, and scale alignment:

Pixel-wise magnitude deviations are raised to exponent $q<1$ ( $q=0.4$ ) for outlier robustness.
Directional errors are weighted by the relative magnitude of ground-truth motion; static or nearly static pixels ( $m_{\rm gt}<\tau_m$ , $\tau_m=10^{-3}$ ) are excluded.
The overlay distance $D_{\rm ovl}$ is produced through a convex combination with empirically selected coefficients ( $\alpha,\beta$ ).
The final MAS score is derived by mapping $D_{\rm ovl}$ into a bounded range, shifted and scaled relative to dataset-specific $(d_{\min}, d_{\max})$ , and applying a hard zeroing rule if model-applied motion is extremely weak ( $\rho_{\min}=0.01$ ).

4. Integration of MAS into Training Objectives

Within the MotionNFT framework, MAS is implemented as a core reward signal during negative-aware fine-tuning of diffusion models. The process operates:

For each training instance, $k$ samples are generated in response to an editing instruction.
MAS is calculated for each, then discretized to $\{0.0,0.2,\dots,1.0\}$ .
A parallel “MLLM reward” assesses instruction fidelity and stylistic alignment.
The final reward is an affine combination: $r_{\rm raw}=\lambda_{\rm motion} r_{\rm motion}+(1-\lambda_{\rm motion})r_{\rm mllm}$ ; $\lambda_{\rm motion}=0.5$ .
Rewards are group-wise normalized and used to interpolate between positive and negative velocity terms in the diffusion flow-matching loss:

$\mathcal L(\theta) = \mathbb E\left[ r \|v^+_\theta - v_{\rm target}\|^2 + (1-r) \|v^-_\theta - v_{\rm target}\|^2 \right]$

This design ensures that model updates are directly sensitive to the degree of motion alignment achieved, as quantified by MAS, balanced against general editing quality metrics (Wan et al., 11 Dec 2025).

High-level Pseudocode of the MAS-based Training Loop

for step in 1..N_steps:
  sample a minibatch of (image I_orig, instruction c, target I_gt)
  for each example in batch:
    # 1) generate k candidate edits
    {I_edit^i = model.sample(c, I_orig)}_{i=1..k}
    # 2) compute optical flows
    V_gt    = FlowNet(I_orig, I_gt)    # H×W×2
    for i=1..k:
      V_pred = FlowNet(I_orig, I_edit^i)
      # 3) compute D_mag, D_dir per formulas above
      D_mag = mean_{i,j} (||~V_pred - ~V_gt||_1 + eps)^q
      D_dir = sum_{i,j} w(i,j)*e_dir(i,j) / sum w(i,j)
      D_ovl = alpha*D_mag + beta*D_dir
      # 4) normalize and invert into [0,1]
      norm = clip((D_ovl - D_min)/(D_max-D_min),0,1)
      r_motion^i = 1 - norm
      # 5) quantize to {0,0.2,...,1.0}
      r_motion^i = round(5*r_motion^i)/5
    # 6) query MLLM reward r_mllm^i for each I_edit^i
    # 7) combine raw reward
    r_raw^i = lambda_motion * r_motion^i + (1-lambda_motion)*r_mllm^i
  # 8) groupwise normalize {r_raw^i} → {r_i} in [0,1] (Diffusion NFT)
  # 9) form positive/negative velocity terms v^+_θ, v^-_θ
  # 10) compute loss L(θ) = E_i [ r_i||v^+_θ - v_target||^2 + (1-r_i)||v^-_θ - v_target||^2 ]
  optimizer.step(L)

5. Ablation Studies and Sensitivity Analysis

Ablation experiments demonstrate the importance and optimal usage of MAS in fine-tuning:

Varying $\lambda_{\rm motion}$ in the reward blend affects final alignment: pure-motion ( $\lambda=1.0$ ) underperforms relative to the best mixed setting ( $\lambda=0.5$ ); pure-MLLM ( $\lambda=0$ ) yields higher semantic fidelity but reduced geometric precision.
MAS alone is insufficient for highest visual quality, but its inclusion is critical for accurate motion transfer.
During training, policies optimizing only MLLM rewards plateau or degrade in MAS, while MAS-guided optimization produces consistent improvements (+1–2 MAS points).
On MotionEdit-Bench, baseline diffusion models achieve MAS $\approx 18$ , while MotionNFT-tuned variants reach MAS $\approx 57$ . The average MAS gain observed with MotionNFT is 1.2–1.7 points over competitive baselines (see Tables and Figures in (Wan et al., 11 Dec 2025)).

Reward Mixing ( $\lambda_{\rm motion}$ )	Visual Fidelity	MAS Precision
0.0 (MLLM only)	High	Low
0.5 (Optimal blend)	High	High
1.0 (MAS only)	Lower	Moderate (Someartifacts)

A plausible implication is that MAS, while robust for motion fidelity, must be harmonized with broader perceptual signals to yield semantically and visually optimal results.

6. Empirical Ranges and Practical Implications

Typical MAS values, as reported on MotionEdit-Bench, scale from $\sim$ 18 for weak baselines up to $\sim$ 57 for MotionNFT-tuned models, calibrated on a [0,100] scale. These figures characterize both the difficulty of the motion-centric editing task and the incremental benefits yielded by MAS-guided fine-tuning. The zeroing rule, which sets MAS to zero for negligible predicted motion, prevents reward leakage in trivial or static cases, ensuring that the metric remains informative and reliable in overseeing meaningful edits (Wan et al., 11 Dec 2025).

This rigorous construction and the empirical performance of MAS establish it as a specialized, quantitatively sensitive metric for training and benchmarking motion-centric image editing systems.

PDF Markdown Chat (Pro)

References (1)

MotionEdit: Benchmarking and Learning Motion-Centric Image Editing (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Motion Alignment Score (MAS).