NRVBench: Non-Rigid Video Editing Benchmark

Updated 2 February 2026

NRVBench is a comprehensive benchmark for non-rigid video editing, featuring a curated 180-clip dataset and detailed physics-based deformation categories.
It introduces NRVE-Acc, an innovative evaluation metric using vision-language models to assess instruction alignment, physical plausibility, and temporal consistency.
The official baseline, VM-Edit, employs dual-clock anchoring and region-conditioned sampling to achieve precise local edits while preserving global scene integrity.

NRVBench is the first dedicated and comprehensive benchmarking suite for non-rigid video editing, designed to address the limitations of prior text-driven video editing methods in generating physically plausible, temporally coherent non-rigid deformations. It establishes standardized protocols for dataset construction, task definition, evaluation metrics, and baseline methodologies, thereby enabling rigorous assessment and advancement of physics-aware video editing systems (Qu et al., 26 Jan 2026).

1. Dataset Curation and Task Taxonomy

NRVBench comprises a curated dataset of 180 high-quality video clips, obtained from DAVIS and Pexels, each featuring a single primary non-rigidly deformable subject. Clips are trimmed to 60 frames at a minimum 720p resolution, ensuring single-shot scenes. Precise segmentation masks are generated by SAM2 and manually refined to pixel-perfect quality via a structured three-role human-in-the-loop workflow (Annotator A → Reviewer B → Lead). Each video is annotated according to six physics-based deformation categories:

Category	Description	Examples
ASB	Articulated Soft Bodies (joint constraints)	Humans, animals
CTS	Cloth/Thin-Shells (folding, surface continuity)	Fabrics, textiles
HFF	Hair/Fur/Feathers (fibrous coherence)	Animal hair, feathers
LFS	Liquid Free Surfaces (volume, flow)	Water, splashes
GSF	Gas/Smoke/Fire (turbulence, topology)	Smoke, fire
DSO	Deformable Solids (elastic recovery, integrity)	Rubber, clay

Task instructions are defined via 2,340 fine-grained, physics-anchored edit prompts. Prompt templates stem from a three-level edit taxonomy (Degree, Topology, Attribute), refined on a pilot set (Benchmark-V0). Prompts are generated via GPT-4o, expressly referencing the deformable object and embedding physical constraints (e.g., “increase cloth stiffness to sharpen folds”), with explicit preservation of scene context. Diagnostic assessment employs 360 category-calibrated multiple-choice questions (MCQs) per clip and edit, covering identity (instruction alignment), physics, and temporal criteria. All MCQs are verified and adjudicated for exclusivity and clarity.

2. NRVE-Acc: VLM-Based Evaluation Protocol

NRVBench’s evaluation uses NRVE-Acc, a purpose-built metric leveraging Vision-LLMs (Qwen2.5-VL) to overcome the insensitivity of classical metrics (LPIPS, SSIM, CLIP-Sim) to non-rigid motion and fine-grained physical dynamics. NRVE-Acc is computed as follows:

Components

Instruction Alignment ( $S_{\mathrm{instr}}$ ): Fraction of MCQs answered correctly. Given $N_{\text{MCQs}}$ ,

$S_{\mathrm{instr}} = \frac{1}{N_{\mathrm{MCQs}}} \sum_{i=1}^{N_{\mathrm{MCQs}}} 1[\text{answer}_i = \text{gt}_i]$

Scaled to $[0, 100]$ .

Physical Plausibility ( $S_{\mathrm{phy}}$ ): Likert-scale (1–5) VLM rating for physical law adherence (e.g., volume conservation, topology). Normalized,

$S_{\mathrm{phy}} = \frac{\text{rating} - 1}{5 - 1}$

Scaled to $[0, 100]$ .

Temporal Consistency ( $S_{\mathrm{temp}}$ ): VLM judgment on sampled optical flow frames, mapped to $\{100, 50, 0\}$ for {A, B, C}, reflecting degrees of flicker or “teleportation.”
Aggregate Score:

$\text{NRVE-Acc}(V_{\text{edit}}) = \prod_{k \in \{\mathrm{instr, phy, temp}\}} \left( \frac{S_k}{100} + \epsilon \right)^{1/3} \times 100$

where $\epsilon = 10^{-6}$ for numerical stability.

This design ensures sharp penalties for deficiencies along any axis (instruction, physics, or temporal fidelity), creating a robust metric for complex non-rigid edits.

3. Baseline Method: VM-Edit

VM-Edit constitutes the official training-free baseline for NRVBench, operating as a region-conditioned, dual-region diffusion sampler. The technique applies a two-clock anchoring mechanism to foreground (editable) versus background regions, optimizing the tradeoff between structural preservation and dynamic deformation without retraining.

Algorithmic Steps:

Latent Encoding: Encodes video $V_{\text{src}}$ via pretrained VAE to $z_0^{\text{src}}$ ; at diffusion step $t$ :

$z_t^{\text{src}} = \alpha_t z_0^{\text{src}} + \sigma_t \epsilon,\quad \epsilon \sim \mathcal{N}(0,I)$

Reverse Denoising: Conditioned denoiser $\Phi$ yields:

$\hat{z}_{t-1} = \Phi(z_t, t, P_{\text{tgt}}, I_0)$

Dual-Region Recomposition: Downsampled mask $m$ splits update:

$z_{t-1} \leftarrow m \odot z_{t-1}^{\text{fg}} + (1-m) \odot z_{t-1}^{\text{bg}}$

Two-Clock Anchoring:

$z_{t-1}^{\text{bg}} = \begin{cases} z_{t-1}^{\text{src}}, & t > \tau_{\text{bg}} \ \hat{z}_{t-1}, & t \leq \tau_{\text{bg}} \end{cases}$

Foreground ( $\tau_{\text{fg}}$ ) and background ( $\tau_{\text{bg}}$ ) clocks are calibrated per category (e.g., ASB/HFF: (5,3), CTS/DSO: (7,5), LFS/GSF: (9,6), scale 0–10).

This approach achieves surgical local edits while preserving global background integrity, outperforming pure propagation, inversion, or full-frame generative baselines under region-conditioned tasks.

4. Experimental Benchmarking and Key Findings

NRVBench V1 comprises 180 clips × 60 frames, supporting extensive benchmarking across standard and custom metrics.

Summary Table: Core Metrics (V1):

Method	Struct Dist (↓)	PSNR (↑)	LPIPS (↓)	SSIM (↑)	CLIP-S/full (↑)	Motion Fidelity (↑)	NIQE (↓)	FPS
VM-Edit	8.69	35.79	47.89	95.72	26.15	60.94	7.36	2.21
Wan-Edit	17.66	29.37	111.92	92.15	26.63	60.65	8.28	2.67
TokenFlow	111.93	21.10	134.31	75.30	26.47	58.49	7.98	3.11

VM-Edit yields the highest spatial fidelity and background PSNR, robust perceptual quality, and competitive text and motion alignment at modest speed cost.

NRVE-Acc Results:

Method	S_phy	S_temp	S_instr	NRVE-Acc	Time (s)
Wan-Edit	73.22	54.56	60.65	36.88	0.37
VM-Edit (O)	71.44	49.44	68.89	34.71	0.45
TokenFlow	71.00	45.78	67.50	33.05	0.32
AnyV2V	71.33	43.11	65.56	30.35	0.35

Wan-Edit scores slightly higher on NRVE-Acc but suffers from significant background drift (PSNR -6.42 dB vs VM-Edit). VM-Edit uniquely combines targeted local edits with background integrity, securing high scores across all NRVE-Acc dimensions.

5. Comparative Analysis and Methodological Limitations

Propagation-based methods (AnyV2V, TokenFlow) exhibit pronounced structure distortion and temporal flicker under large non-rigid deformations due to broken feature correspondences. Full-frame generators (Wan-Edit, Pyramid-Edit) maintain global motion smoothness but induce unacceptable background drift and weak structure fidelity during region-specific tasks. Inversion-based techniques, while faster, oversmooth local details or cause topological collapse under heavy deformation scenarios.

Qualitative Observations:

ASB: VM-Edit maintains anatomically plausible limb proportions, unlike competitors which generate unrealistic stretching.
CTS: VM-Edit renders crisp cloth folds, while propagation methods introduce tearing.
LFS/GSF: VM-Edit generates natural splashes and smoke; other methods freeze fluid flow or show flicker.

A plausible implication is that segmentation precision and region-conditioned anchoring are crucial for physically credentialed non-rigid video editing.

6. Impact and Future Extensions

NRVBench, with its physics-grounded dataset, prompt engineering, diagnostically verified MCQs, and NRVE-Acc metric, sets a new benchmark standard for evaluating non-rigid video editing systems. Process-level baselines, such as VM-Edit, illustrate the benefit of dual-clock anchoring and region-conditioned sampling in achieving local edit precision without sacrificing global consistency.

Potential future directions include expanding the taxonomy to encompass more nuanced deformation classes, integrating interactive or dialogue-based constraint specification, augmenting VLM QA modules for finer-grained instruction-physics alignment, and developing adversarial edits to probe model weaknesses. Community adoption of NRVBench may facilitate accelerated progress in physically accurate, robust video editing methodologies.

Markdown Upgrade to Chat

References (1)

Beyond Rigid: Benchmarking Non-Rigid Video Editing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NRVBench.