Papers
Topics
Authors
Recent
Search
2000 character limit reached

NRVBench: Non-Rigid Video Editing Benchmark

Updated 2 February 2026
  • NRVBench is a comprehensive benchmark for non-rigid video editing, featuring a curated 180-clip dataset and detailed physics-based deformation categories.
  • It introduces NRVE-Acc, an innovative evaluation metric using vision-language models to assess instruction alignment, physical plausibility, and temporal consistency.
  • The official baseline, VM-Edit, employs dual-clock anchoring and region-conditioned sampling to achieve precise local edits while preserving global scene integrity.

NRVBench is the first dedicated and comprehensive benchmarking suite for non-rigid video editing, designed to address the limitations of prior text-driven video editing methods in generating physically plausible, temporally coherent non-rigid deformations. It establishes standardized protocols for dataset construction, task definition, evaluation metrics, and baseline methodologies, thereby enabling rigorous assessment and advancement of physics-aware video editing systems (Qu et al., 26 Jan 2026).

1. Dataset Curation and Task Taxonomy

NRVBench comprises a curated dataset of 180 high-quality video clips, obtained from DAVIS and Pexels, each featuring a single primary non-rigidly deformable subject. Clips are trimmed to 60 frames at a minimum 720p resolution, ensuring single-shot scenes. Precise segmentation masks are generated by SAM2 and manually refined to pixel-perfect quality via a structured three-role human-in-the-loop workflow (Annotator A → Reviewer B → Lead). Each video is annotated according to six physics-based deformation categories:

Category Description Examples
ASB Articulated Soft Bodies (joint constraints) Humans, animals
CTS Cloth/Thin-Shells (folding, surface continuity) Fabrics, textiles
HFF Hair/Fur/Feathers (fibrous coherence) Animal hair, feathers
LFS Liquid Free Surfaces (volume, flow) Water, splashes
GSF Gas/Smoke/Fire (turbulence, topology) Smoke, fire
DSO Deformable Solids (elastic recovery, integrity) Rubber, clay

Task instructions are defined via 2,340 fine-grained, physics-anchored edit prompts. Prompt templates stem from a three-level edit taxonomy (Degree, Topology, Attribute), refined on a pilot set (Benchmark-V0). Prompts are generated via GPT-4o, expressly referencing the deformable object and embedding physical constraints (e.g., “increase cloth stiffness to sharpen folds”), with explicit preservation of scene context. Diagnostic assessment employs 360 category-calibrated multiple-choice questions (MCQs) per clip and edit, covering identity (instruction alignment), physics, and temporal criteria. All MCQs are verified and adjudicated for exclusivity and clarity.

2. NRVE-Acc: VLM-Based Evaluation Protocol

NRVBench’s evaluation uses NRVE-Acc, a purpose-built metric leveraging Vision-LLMs (Qwen2.5-VL) to overcome the insensitivity of classical metrics (LPIPS, SSIM, CLIP-Sim) to non-rigid motion and fine-grained physical dynamics. NRVE-Acc is computed as follows:

Components

  1. Instruction Alignment (SinstrS_{\mathrm{instr}}): Fraction of MCQs answered correctly. Given NMCQsN_{\text{MCQs}},

Sinstr=1NMCQsi=1NMCQs1[answeri=gti]S_{\mathrm{instr}} = \frac{1}{N_{\mathrm{MCQs}}} \sum_{i=1}^{N_{\mathrm{MCQs}}} 1[\text{answer}_i = \text{gt}_i]

Scaled to [0,100][0, 100].

  1. Physical Plausibility (SphyS_{\mathrm{phy}}): Likert-scale (1–5) VLM rating for physical law adherence (e.g., volume conservation, topology). Normalized,

Sphy=rating151S_{\mathrm{phy}} = \frac{\text{rating} - 1}{5 - 1}

Scaled to [0,100][0, 100].

  1. Temporal Consistency (StempS_{\mathrm{temp}}): VLM judgment on sampled optical flow frames, mapped to {100,50,0}\{100, 50, 0\} for {A, B, C}, reflecting degrees of flicker or “teleportation.”
  2. Aggregate Score:

NRVE-Acc(Vedit)=k{instr,phy,temp}(Sk100+ϵ)1/3×100\text{NRVE-Acc}(V_{\text{edit}}) = \prod_{k \in \{\mathrm{instr, phy, temp}\}} \left( \frac{S_k}{100} + \epsilon \right)^{1/3} \times 100

where ϵ=106\epsilon = 10^{-6} for numerical stability.

This design ensures sharp penalties for deficiencies along any axis (instruction, physics, or temporal fidelity), creating a robust metric for complex non-rigid edits.

3. Baseline Method: VM-Edit

VM-Edit constitutes the official training-free baseline for NRVBench, operating as a region-conditioned, dual-region diffusion sampler. The technique applies a two-clock anchoring mechanism to foreground (editable) versus background regions, optimizing the tradeoff between structural preservation and dynamic deformation without retraining.

Algorithmic Steps:

  • Latent Encoding: Encodes video VsrcV_{\text{src}} via pretrained VAE to z0srcz_0^{\text{src}}; at diffusion step tt:

ztsrc=αtz0src+σtϵ,ϵN(0,I)z_t^{\text{src}} = \alpha_t z_0^{\text{src}} + \sigma_t \epsilon,\quad \epsilon \sim \mathcal{N}(0,I)

  • Reverse Denoising: Conditioned denoiser Φ\Phi yields:

z^t1=Φ(zt,t,Ptgt,I0)\hat{z}_{t-1} = \Phi(z_t, t, P_{\text{tgt}}, I_0)

  • Dual-Region Recomposition: Downsampled mask mm splits update:

zt1mzt1fg+(1m)zt1bgz_{t-1} \leftarrow m \odot z_{t-1}^{\text{fg}} + (1-m) \odot z_{t-1}^{\text{bg}}

  • Two-Clock Anchoring:

zt1bg={zt1src,t>τbg z^t1,tτbgz_{t-1}^{\text{bg}} = \begin{cases} z_{t-1}^{\text{src}}, & t > \tau_{\text{bg}} \ \hat{z}_{t-1}, & t \leq \tau_{\text{bg}} \end{cases}

Foreground (τfg\tau_{\text{fg}}) and background (τbg\tau_{\text{bg}}) clocks are calibrated per category (e.g., ASB/HFF: (5,3), CTS/DSO: (7,5), LFS/GSF: (9,6), scale 0–10).

This approach achieves surgical local edits while preserving global background integrity, outperforming pure propagation, inversion, or full-frame generative baselines under region-conditioned tasks.

4. Experimental Benchmarking and Key Findings

NRVBench V1 comprises 180 clips × 60 frames, supporting extensive benchmarking across standard and custom metrics.

Summary Table: Core Metrics (V1):

Method Struct Dist (↓) PSNR (↑) LPIPS (↓) SSIM (↑) CLIP-S/full (↑) Motion Fidelity (↑) NIQE (↓) FPS
VM-Edit 8.69 35.79 47.89 95.72 26.15 60.94 7.36 2.21
Wan-Edit 17.66 29.37 111.92 92.15 26.63 60.65 8.28 2.67
TokenFlow 111.93 21.10 134.31 75.30 26.47 58.49 7.98 3.11

VM-Edit yields the highest spatial fidelity and background PSNR, robust perceptual quality, and competitive text and motion alignment at modest speed cost.

NRVE-Acc Results:

Method S_phy S_temp S_instr NRVE-Acc Time (s)
Wan-Edit 73.22 54.56 60.65 36.88 0.37
VM-Edit (O) 71.44 49.44 68.89 34.71 0.45
TokenFlow 71.00 45.78 67.50 33.05 0.32
AnyV2V 71.33 43.11 65.56 30.35 0.35

Wan-Edit scores slightly higher on NRVE-Acc but suffers from significant background drift (PSNR -6.42 dB vs VM-Edit). VM-Edit uniquely combines targeted local edits with background integrity, securing high scores across all NRVE-Acc dimensions.

5. Comparative Analysis and Methodological Limitations

Propagation-based methods (AnyV2V, TokenFlow) exhibit pronounced structure distortion and temporal flicker under large non-rigid deformations due to broken feature correspondences. Full-frame generators (Wan-Edit, Pyramid-Edit) maintain global motion smoothness but induce unacceptable background drift and weak structure fidelity during region-specific tasks. Inversion-based techniques, while faster, oversmooth local details or cause topological collapse under heavy deformation scenarios.

Qualitative Observations:

  • ASB: VM-Edit maintains anatomically plausible limb proportions, unlike competitors which generate unrealistic stretching.
  • CTS: VM-Edit renders crisp cloth folds, while propagation methods introduce tearing.
  • LFS/GSF: VM-Edit generates natural splashes and smoke; other methods freeze fluid flow or show flicker.

A plausible implication is that segmentation precision and region-conditioned anchoring are crucial for physically credentialed non-rigid video editing.

6. Impact and Future Extensions

NRVBench, with its physics-grounded dataset, prompt engineering, diagnostically verified MCQs, and NRVE-Acc metric, sets a new benchmark standard for evaluating non-rigid video editing systems. Process-level baselines, such as VM-Edit, illustrate the benefit of dual-clock anchoring and region-conditioned sampling in achieving local edit precision without sacrificing global consistency.

Potential future directions include expanding the taxonomy to encompass more nuanced deformation classes, integrating interactive or dialogue-based constraint specification, augmenting VLM QA modules for finer-grained instruction-physics alignment, and developing adversarial edits to probe model weaknesses. Community adoption of NRVBench may facilitate accelerated progress in physically accurate, robust video editing methodologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NRVBench.