NRVBench: Non-Rigid Video Editing Benchmark
- NRVBench is a comprehensive benchmark for non-rigid video editing, featuring a curated 180-clip dataset and detailed physics-based deformation categories.
- It introduces NRVE-Acc, an innovative evaluation metric using vision-language models to assess instruction alignment, physical plausibility, and temporal consistency.
- The official baseline, VM-Edit, employs dual-clock anchoring and region-conditioned sampling to achieve precise local edits while preserving global scene integrity.
NRVBench is the first dedicated and comprehensive benchmarking suite for non-rigid video editing, designed to address the limitations of prior text-driven video editing methods in generating physically plausible, temporally coherent non-rigid deformations. It establishes standardized protocols for dataset construction, task definition, evaluation metrics, and baseline methodologies, thereby enabling rigorous assessment and advancement of physics-aware video editing systems (Qu et al., 26 Jan 2026).
1. Dataset Curation and Task Taxonomy
NRVBench comprises a curated dataset of 180 high-quality video clips, obtained from DAVIS and Pexels, each featuring a single primary non-rigidly deformable subject. Clips are trimmed to 60 frames at a minimum 720p resolution, ensuring single-shot scenes. Precise segmentation masks are generated by SAM2 and manually refined to pixel-perfect quality via a structured three-role human-in-the-loop workflow (Annotator A → Reviewer B → Lead). Each video is annotated according to six physics-based deformation categories:
| Category | Description | Examples |
|---|---|---|
| ASB | Articulated Soft Bodies (joint constraints) | Humans, animals |
| CTS | Cloth/Thin-Shells (folding, surface continuity) | Fabrics, textiles |
| HFF | Hair/Fur/Feathers (fibrous coherence) | Animal hair, feathers |
| LFS | Liquid Free Surfaces (volume, flow) | Water, splashes |
| GSF | Gas/Smoke/Fire (turbulence, topology) | Smoke, fire |
| DSO | Deformable Solids (elastic recovery, integrity) | Rubber, clay |
Task instructions are defined via 2,340 fine-grained, physics-anchored edit prompts. Prompt templates stem from a three-level edit taxonomy (Degree, Topology, Attribute), refined on a pilot set (Benchmark-V0). Prompts are generated via GPT-4o, expressly referencing the deformable object and embedding physical constraints (e.g., “increase cloth stiffness to sharpen folds”), with explicit preservation of scene context. Diagnostic assessment employs 360 category-calibrated multiple-choice questions (MCQs) per clip and edit, covering identity (instruction alignment), physics, and temporal criteria. All MCQs are verified and adjudicated for exclusivity and clarity.
2. NRVE-Acc: VLM-Based Evaluation Protocol
NRVBench’s evaluation uses NRVE-Acc, a purpose-built metric leveraging Vision-LLMs (Qwen2.5-VL) to overcome the insensitivity of classical metrics (LPIPS, SSIM, CLIP-Sim) to non-rigid motion and fine-grained physical dynamics. NRVE-Acc is computed as follows:
Components
- Instruction Alignment (): Fraction of MCQs answered correctly. Given ,
Scaled to .
- Physical Plausibility (): Likert-scale (1–5) VLM rating for physical law adherence (e.g., volume conservation, topology). Normalized,
Scaled to .
- Temporal Consistency (): VLM judgment on sampled optical flow frames, mapped to for {A, B, C}, reflecting degrees of flicker or “teleportation.”
- Aggregate Score:
where for numerical stability.
This design ensures sharp penalties for deficiencies along any axis (instruction, physics, or temporal fidelity), creating a robust metric for complex non-rigid edits.
3. Baseline Method: VM-Edit
VM-Edit constitutes the official training-free baseline for NRVBench, operating as a region-conditioned, dual-region diffusion sampler. The technique applies a two-clock anchoring mechanism to foreground (editable) versus background regions, optimizing the tradeoff between structural preservation and dynamic deformation without retraining.
Algorithmic Steps:
- Latent Encoding: Encodes video via pretrained VAE to ; at diffusion step :
- Reverse Denoising: Conditioned denoiser yields:
- Dual-Region Recomposition: Downsampled mask splits update:
- Two-Clock Anchoring:
Foreground () and background () clocks are calibrated per category (e.g., ASB/HFF: (5,3), CTS/DSO: (7,5), LFS/GSF: (9,6), scale 0–10).
This approach achieves surgical local edits while preserving global background integrity, outperforming pure propagation, inversion, or full-frame generative baselines under region-conditioned tasks.
4. Experimental Benchmarking and Key Findings
NRVBench V1 comprises 180 clips × 60 frames, supporting extensive benchmarking across standard and custom metrics.
Summary Table: Core Metrics (V1):
| Method | Struct Dist (↓) | PSNR (↑) | LPIPS (↓) | SSIM (↑) | CLIP-S/full (↑) | Motion Fidelity (↑) | NIQE (↓) | FPS |
|---|---|---|---|---|---|---|---|---|
| VM-Edit | 8.69 | 35.79 | 47.89 | 95.72 | 26.15 | 60.94 | 7.36 | 2.21 |
| Wan-Edit | 17.66 | 29.37 | 111.92 | 92.15 | 26.63 | 60.65 | 8.28 | 2.67 |
| TokenFlow | 111.93 | 21.10 | 134.31 | 75.30 | 26.47 | 58.49 | 7.98 | 3.11 |
VM-Edit yields the highest spatial fidelity and background PSNR, robust perceptual quality, and competitive text and motion alignment at modest speed cost.
NRVE-Acc Results:
| Method | S_phy | S_temp | S_instr | NRVE-Acc | Time (s) |
|---|---|---|---|---|---|
| Wan-Edit | 73.22 | 54.56 | 60.65 | 36.88 | 0.37 |
| VM-Edit (O) | 71.44 | 49.44 | 68.89 | 34.71 | 0.45 |
| TokenFlow | 71.00 | 45.78 | 67.50 | 33.05 | 0.32 |
| AnyV2V | 71.33 | 43.11 | 65.56 | 30.35 | 0.35 |
Wan-Edit scores slightly higher on NRVE-Acc but suffers from significant background drift (PSNR -6.42 dB vs VM-Edit). VM-Edit uniquely combines targeted local edits with background integrity, securing high scores across all NRVE-Acc dimensions.
5. Comparative Analysis and Methodological Limitations
Propagation-based methods (AnyV2V, TokenFlow) exhibit pronounced structure distortion and temporal flicker under large non-rigid deformations due to broken feature correspondences. Full-frame generators (Wan-Edit, Pyramid-Edit) maintain global motion smoothness but induce unacceptable background drift and weak structure fidelity during region-specific tasks. Inversion-based techniques, while faster, oversmooth local details or cause topological collapse under heavy deformation scenarios.
Qualitative Observations:
- ASB: VM-Edit maintains anatomically plausible limb proportions, unlike competitors which generate unrealistic stretching.
- CTS: VM-Edit renders crisp cloth folds, while propagation methods introduce tearing.
- LFS/GSF: VM-Edit generates natural splashes and smoke; other methods freeze fluid flow or show flicker.
A plausible implication is that segmentation precision and region-conditioned anchoring are crucial for physically credentialed non-rigid video editing.
6. Impact and Future Extensions
NRVBench, with its physics-grounded dataset, prompt engineering, diagnostically verified MCQs, and NRVE-Acc metric, sets a new benchmark standard for evaluating non-rigid video editing systems. Process-level baselines, such as VM-Edit, illustrate the benefit of dual-clock anchoring and region-conditioned sampling in achieving local edit precision without sacrificing global consistency.
Potential future directions include expanding the taxonomy to encompass more nuanced deformation classes, integrating interactive or dialogue-based constraint specification, augmenting VLM QA modules for finer-grained instruction-physics alignment, and developing adversarial edits to probe model weaknesses. Community adoption of NRVBench may facilitate accelerated progress in physically accurate, robust video editing methodologies.