VMReal: Large-scale Real-World Video Matting Dataset
- VMReal is a large-scale, real-world video matting dataset that provides 28,000 clips and 2.4 million annotated frames, balancing temporal stability with boundary precision.
- The dataset employs dual-branch automated annotation—video and image matting—with a learned Matting Quality Evaluator to fuse results and generate per-pixel reliability maps.
- Evaluation protocols and reference-frame training demonstrate state-of-the-art performance, including significant reductions in boundary errors compared to previous methods.
VMReal is a large-scale, fully real-world video matting dataset comprising 28,000 clips and approximately 2.4 million annotated frames, designed to advance the state of high-fidelity video alpha matting and provide per-pixel error/reliability maps for each sample. Developed by curating real video data through dual automated annotation pipelines and a learned Matting Quality Evaluator (MQE), VMReal combines semantic and boundary matting supervision while eliminating the need for large-scale manual labeling. Its curation and usage protocols address long-standing challenges in balancing temporal stability and boundary accuracy in video matting datasets (Yang et al., 12 Dec 2025).
1. Data Collection and Curation Pipeline
1.1. Raw Sources
VMReal assembles its content from two principal sources:
- A high-quality subset of 4,500 real, human-centric video clips at 1080p, extracted from commercial footage platforms and YouTube.
- The remaining ∼23,500 clips (720p) are derived from the SA-V VOS dataset, rigorously filtered to retain only clips with complete human-instance masks.
1.2. Dual-Branch Automated Annotation
Two complementary matting annotation branches are instantiated for each frame:
- Branch B_V (Video-Matting): Propagation-based video matting model (e.g., MatAnyone), initialized with a first-frame SAM 2 mask, yielding with strong temporal (semantic) stability but impaired boundary precision.
- Branch B_I (Image-Matting): Per-frame guided image-matting model (e.g., MattePro with SAM 2 prompts), yielding with accurate, sharp edges but proneness to temporal inconsistency.
1.3. MQE-Based Fusion and Reliability
The Matting Quality Evaluator (MQE) operates as follows:
- Input: For each frame, MQE ingests .
- Output: It produces a pixel-wise binary reliability map , where $1$ denotes a reliable pixel and $0$ an erroneous one. The underlying probability is thresholded at $0.5$.
- Fusion: For each and , MQE outputs and . The fusion mask is computed as
A small Gaussian blur smooths , after which the fused alpha matte is:
The fused reliability map is .
1.4. MQE Training
MQE is trained on P3M-10k using synthetic ground-truth generation. Images are partitioned into patches, and a merged patch-wise error metric is computed:
Normalized is thresholded () to generate binary labels . MQE optimization uses focal and dice losses, balancing detection of semantic errors (core regions) and boundary errors (detail regions) (Yang et al., 12 Dec 2025).
2. Dataset Composition and Properties
VMReal encompasses the following attributes:
- Total Clips: 28,000
- Total Annotated Frames: ∼2,400,000
- Average Clip Length: 85 frames
- Resolution Breakdown:
- 4,500 clips at
- 23,500 clips at
- Source Mix:
- 16% from the high-quality 1080p subset
- 84% from the SA-V dataset (720p)
- Scene and Instance Diversity:
- Both single- and multi-person settings
- Broad variance in age, gender, and clothing
- Indoors and outdoors, daylight, back-lit, low-light, and mixed lighting
- Motion types from static or slow gestures to rapid motion (hair, interaction with objects)
This diversity enables coverage of appearance, motion, and environmental combinations not tractable in prior datasets.
3. Annotation Protocol and Online Usage
3.1. Fully Automatic Pipeline
No manual matting is performed at scale. provides temporally consistent background, while supplements boundary detail. MQE fuses both modalities on a per-pixel basis per the protocol described in Section 1.3.
3.2. Online MQE Feedback During Training
For network training (e.g., MatAnyone 2 on VMReal), each training tuple consists of . Losses are masked to operate on reliable pixels only:
Similar masked losses are applied for Laplacian and temporal coherence. An additional penalty
encourages minimization of predicted error regions, focusing learning on the most reliable parts of the dataset for supervision (Yang et al., 12 Dec 2025).
4. Reference-Frame Training for Long-Term Variation
To address long-range temporal context and large appearance changes, each training iteration draws an auxiliary "reference frame" sampled outside the local window of consecutive frames. Simulated appearance variations are induced by masking patches near the boundaries and in the core (patch sizes pixels) on the reference and . This mechanism leverages the natural length and diversity of VMReal sequences, enhancing model robustness to emergent structures or novel occluders (Yang et al., 12 Dec 2025).
5. Evaluation Protocols and Benchmarking
5.1. Reference-Based Metrics
On the CRGNN real-world benchmark (19 videos, every 10th frame manually annotated), the following metrics are reported: Mean Absolute Difference (MAD), Mean Squared Error (MSE), Gradient error (Grad), and Delta Temporal Sum of Squared Differences (dtSSD). Representative results:
| Method | MAD↓ | MSE↓ | Grad↓ | dtSSD↓ |
|---|---|---|---|---|
| RVM | 5.98 | 2.79 | 13.68 | 5.36 |
| RVM-Large | 5.75 | 2.47 | 13.26 | 5.17 |
| GVM | 5.03 | 2.15 | 14.28 | 4.86 |
| FTP-VM | 6.64 | 3.54 | 15.10 | 5.98 |
| MaGGIe | 9.50 | 6.11 | 16.51 | 6.02 |
| MatAnyone | 5.76 | 3.04 | 15.55 | 5.44 |
| Ours (MAny 2) | 4.24 | 2.00 | 11.74 | 4.54 |
MatAnyone 2 achieves state-of-the-art performance across all metrics.
5.2. Evaluator-Based (Non-Reference) Metrics
For test sets lacking ground-truth alpha, error rates are derived from MQE output:
- error rate restricted to the foreground
- error rate within a dilated-eroded boundary band
Comparison on a real, non-ground-truthed test set:
| Method | ERR%↓ | MER%↓ | BER%↓ |
|---|---|---|---|
| MatAnyone | 0.62 | 1.69 | 20.23 |
| Ours | 0.46 | 1.13 | 15.19 |
The results indicate a roughly 25% reduction in boundary errors for MatAnyone 2.
6. Significance and Future Directions
VMReal constitutes the first large-scale, purely real-world video matting corpus combining high-fidelity alpha mattes with per-pixel error/reliability maps, an advancement made possible by fusing dual-branch matting results with learned pixel-wise arbitration via MQE (Yang et al., 12 Dec 2025). This methodology addresses longstanding tradeoffs between semantic consistency and boundary precision, fostering improved learning and more reliable generalization, particularly under challenging real-world conditions and long-term temporal variation. A plausible implication is the dataset's potential to enable future matting models beyond human-centric scenarios, assuming similar curation protocols are generalized or extended to new domains.