Papers
Topics
Authors
Recent
Search
2000 character limit reached

VMReal: Large-scale Real-World Video Matting Dataset

Updated 12 March 2026
  • VMReal is a large-scale, real-world video matting dataset that provides 28,000 clips and 2.4 million annotated frames, balancing temporal stability with boundary precision.
  • The dataset employs dual-branch automated annotation—video and image matting—with a learned Matting Quality Evaluator to fuse results and generate per-pixel reliability maps.
  • Evaluation protocols and reference-frame training demonstrate state-of-the-art performance, including significant reductions in boundary errors compared to previous methods.

VMReal is a large-scale, fully real-world video matting dataset comprising 28,000 clips and approximately 2.4 million annotated frames, designed to advance the state of high-fidelity video alpha matting and provide per-pixel error/reliability maps for each sample. Developed by curating real video data through dual automated annotation pipelines and a learned Matting Quality Evaluator (MQE), VMReal combines semantic and boundary matting supervision while eliminating the need for large-scale manual labeling. Its curation and usage protocols address long-standing challenges in balancing temporal stability and boundary accuracy in video matting datasets (Yang et al., 12 Dec 2025).

1. Data Collection and Curation Pipeline

1.1. Raw Sources

VMReal assembles its content from two principal sources:

  • A high-quality subset of 4,500 real, human-centric video clips at 1080p, extracted from commercial footage platforms and YouTube.
  • The remaining ∼23,500 clips (720p) are derived from the SA-V VOS dataset, rigorously filtered to retain only clips with complete human-instance masks.

1.2. Dual-Branch Automated Annotation

Two complementary matting annotation branches are instantiated for each frame:

  • Branch B_V (Video-Matting): Propagation-based video matting model (e.g., MatAnyone), initialized with a first-frame SAM 2 mask, yielding αV\alpha_V with strong temporal (semantic) stability but impaired boundary precision.
  • Branch B_I (Image-Matting): Per-frame guided image-matting model (e.g., MattePro with SAM 2 prompts), yielding αI\alpha_I with accurate, sharp edges but proneness to temporal inconsistency.

1.3. MQE-Based Fusion and Reliability

The Matting Quality Evaluator (MQE) operates as follows:

  • Input: For each frame, MQE ingests (Irgb,α^,Mseg)(I_\textrm{rgb}, \hat{\alpha}, M^\textrm{seg}).
  • Output: It produces a pixel-wise binary reliability map Meval(x)∈{0,1}M^\textrm{eval}(x) \in \{0,1\}, where $1$ denotes a reliable pixel and $0$ an erroneous one. The underlying probability Peval(x)P^\textrm{eval}(x) is thresholded at $0.5$.
  • Fusion: For each αV\alpha_V and αI\alpha_I, MQE outputs MVevalM^\textrm{eval}_V and MIevalM^\textrm{eval}_I. The fusion mask is computed as

Mfuse(x)=MIeval(x)⋅(1−MVeval(x))M^\textrm{fuse}(x) = M^\textrm{eval}_I(x) \cdot (1 - M^\textrm{eval}_V(x))

A small Gaussian blur smooths MfuseM^\textrm{fuse}, after which the fused alpha matte is:

αfused(x)=(1−Mfuse(x))⋅αV(x)+Mfuse(x)⋅αI(x)\alpha_\textrm{fused}(x) = (1 - M^\textrm{fuse}(x)) \cdot \alpha_V(x) + M^\textrm{fuse}(x) \cdot \alpha_I(x)

The fused reliability map is Mfusedeval(x)=MVeval(x)∪MIeval(x)M^\textrm{eval}_\textrm{fused}(x) = M^\textrm{eval}_V(x) \cup M^\textrm{eval}_I(x).

1.4. MQE Training

MQE is trained on P3M-10k using synthetic ground-truth generation. Images are partitioned into 7×77 \times 7 patches, and a merged patch-wise error metric is computed:

D(αgt,α^)=0.9 MAD+0.1 GradD(\alpha_\textrm{gt}, \hat{\alpha}) = 0.9\,\mathrm{MAD} + 0.1\,\textrm{Grad}

Normalized DD is thresholded (δ=0.2\delta=0.2) to generate binary labels MgtevalM^\textrm{eval}_\textrm{gt}. MQE optimization uses focal and dice losses, balancing detection of semantic errors (core regions) and boundary errors (detail regions) (Yang et al., 12 Dec 2025).

2. Dataset Composition and Properties

VMReal encompasses the following attributes:

  • Total Clips: 28,000
  • Total Annotated Frames: ∼2,400,000
  • Average Clip Length: 85 frames
  • Resolution Breakdown:
    • 4,500 clips at 1080×19201080 \times 1920
    • 23,500 clips at 720×1280720 \times 1280
  • Source Mix:
    • 16% from the high-quality 1080p subset
    • 84% from the SA-V dataset (720p)
  • Scene and Instance Diversity:
    • Both single- and multi-person settings
    • Broad variance in age, gender, and clothing
    • Indoors and outdoors, daylight, back-lit, low-light, and mixed lighting
    • Motion types from static or slow gestures to rapid motion (hair, interaction with objects)

This diversity enables coverage of appearance, motion, and environmental combinations not tractable in prior datasets.

3. Annotation Protocol and Online Usage

3.1. Fully Automatic Pipeline

No manual matting is performed at scale. αV\alpha_V provides temporally consistent background, while αI\alpha_I supplements boundary detail. MQE fuses both modalities on a per-pixel basis per the protocol described in Section 1.3.

3.2. Online MQE Feedback During Training

For network training (e.g., MatAnyone 2 on VMReal), each training tuple consists of (Irgb,αfused,Mfusedeval)(I_\textrm{rgb}, \alpha_\textrm{fused}, M^\textrm{eval}_\textrm{fused}). Losses are masked to operate on reliable pixels only:

Lℓ1M=∥Meval⊙(α^−α)∥1∥Meval∥1+ϵ\mathcal{L}_{\ell_1}^M = \frac{\|M^{\textrm{eval}} \odot (\hat{\alpha} - \alpha)\|_1}{\|M^{\textrm{eval}}\|_1 + \epsilon}

Similar masked losses are applied for Laplacian and temporal coherence. An additional penalty

Leval=∥Peval∥1\mathcal{L}_\textrm{eval} = \|P^\textrm{eval}\|_1

encourages minimization of predicted error regions, focusing learning on the most reliable parts of the dataset for supervision (Yang et al., 12 Dec 2025).

4. Reference-Frame Training for Long-Term Variation

To address long-range temporal context and large appearance changes, each training iteration draws an auxiliary "reference frame" sampled outside the local window of T=8T=8 consecutive frames. Simulated appearance variations are induced by masking 0−30{-}3 patches near the boundaries and 0−10{-}1 in the core (patch sizes [50,100][50,100] pixels) on the reference IrgbI_\textrm{rgb} and α\alpha. This mechanism leverages the natural length and diversity of VMReal sequences, enhancing model robustness to emergent structures or novel occluders (Yang et al., 12 Dec 2025).

5. Evaluation Protocols and Benchmarking

5.1. Reference-Based Metrics

On the CRGNN real-world benchmark (19 videos, every 10th frame manually annotated), the following metrics are reported: Mean Absolute Difference (MAD), Mean Squared Error (MSE), Gradient error (Grad), and Delta Temporal Sum of Squared Differences (dtSSD). Representative results:

Method MAD↓ MSE↓ Grad↓ dtSSD↓
RVM 5.98 2.79 13.68 5.36
RVM-Large 5.75 2.47 13.26 5.17
GVM 5.03 2.15 14.28 4.86
FTP-VM 6.64 3.54 15.10 5.98
MaGGIe 9.50 6.11 16.51 6.02
MatAnyone 5.76 3.04 15.55 5.44
Ours (MAny 2) 4.24 2.00 11.74 4.54

MatAnyone 2 achieves state-of-the-art performance across all metrics.

5.2. Evaluator-Based (Non-Reference) Metrics

For test sets lacking ground-truth alpha, error rates are derived from MQE output:

  • ERR=100⋅∣{x:Meval(x)=0}∣/HWERR = 100 \cdot |\{x: M^\textrm{eval}(x) = 0\}| / HW
  • MER=MER = error rate restricted to the foreground
  • BER=BER = error rate within a dilated-eroded boundary band

Comparison on a real, non-ground-truthed test set:

Method ERR%↓ MER%↓ BER%↓
MatAnyone 0.62 1.69 20.23
Ours 0.46 1.13 15.19

The results indicate a roughly 25% reduction in boundary errors for MatAnyone 2.

6. Significance and Future Directions

VMReal constitutes the first large-scale, purely real-world video matting corpus combining high-fidelity alpha mattes with per-pixel error/reliability maps, an advancement made possible by fusing dual-branch matting results with learned pixel-wise arbitration via MQE (Yang et al., 12 Dec 2025). This methodology addresses longstanding tradeoffs between semantic consistency and boundary precision, fostering improved learning and more reliable generalization, particularly under challenging real-world conditions and long-term temporal variation. A plausible implication is the dataset's potential to enable future matting models beyond human-centric scenarios, assuming similar curation protocols are generalized or extended to new domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VMReal Dataset.