MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator (2512.11782v1)

Published 12 Dec 2025 in cs.CV

Abstract: Video matting remains limited by the scale and realism of existing datasets. While leveraging segmentation data can enhance semantic stability, the lack of effective boundary supervision often leads to segmentation-like mattes lacking fine details. To this end, we introduce a learned Matting Quality Evaluator (MQE) that assesses semantic and boundary quality of alpha mattes without ground truth. It produces a pixel-wise evaluation map that identifies reliable and erroneous regions, enabling fine-grained quality assessment. The MQE scales up video matting in two ways: (1) as an online matting-quality feedback during training to suppress erroneous regions, providing comprehensive supervision, and (2) as an offline selection module for data curation, improving annotation quality by combining the strengths of leading video and image matting models. This process allows us to build a large-scale real-world video matting dataset, VMReal, containing 28K clips and 2.4M frames. To handle large appearance variations in long videos, we introduce a reference-frame training strategy that incorporates long-range frames beyond the local window for effective training. Our MatAnyone 2 achieves state-of-the-art performance on both synthetic and real-world benchmarks, surpassing prior methods across all metrics.

Summary

The paper introduces a Matting Quality Evaluator (MQE) that dynamically guides both online training and offline data fusion.
It demonstrates significant improvements in semantic accuracy, boundary fidelity, and temporal robustness across multiple benchmarks.
The study scales video matting using a dual-branch annotation strategy to create the large-scale VMReal dataset with 28K clips and 2.4M frames.

MatAnyone 2: Scaling Video Matting via a Learned Quality Evaluator

Introduction

MatAnyone 2 introduces a unified framework for scaling video matting (VM) models through a learned Matting Quality Evaluator (MQE), addressing core limitations in prior approaches regarding the realism, scale, and annotation quality of available VM datasets. The combination of MQE-based online training guidance, automated dual-branch annotation for data curation, and a reference-frame training strategy collectively enables substantial improvements in semantic accuracy, boundary fidelity, and temporal robustness.

Figure 1: Comparison of MatAnyone and MatAnyone 2. The new model leverages MQE-based loss, online/offline quality assessment, and an expanded real-world dataset (VMReal) to remedy the boundary blurring, segmentation-like artifacts, and limited temporal range in MatAnyone.

Limitations of Prior Video Matting Paradigms

Previous VM methods relied primarily on synthetic data with limited diversity and imperfect compositing, or on segmentation-based supervision. Synthetic datasets, such as VM800, are orders of magnitude smaller and less scene-diverse than large-scale video object segmentation (VOS) datasets, resulting in a domain gap that hinders generalization to challenging, real-world scenarios.

Incorporating segmentation priors—either via pretraining or joint training—ameliorates core-region semantic instability but leaves boundary regions weakly supervised, especially when semantic and detail requirements diverge. This frequently results in “segmentation-like” mattes lacking fine structure, visible as indistinct hair strands, poor temporal consistency, or false positives in complex environments.

The Matting Quality Evaluator (MQE): Online and Offline Scaling

MatAnyone 2’s central contribution is the Matting Quality Evaluator, a discriminator for matting quality that produces a pixel-wise binary evaluation map distinguishing reliable from erroneous alpha matte regions. The MQE is constructed as a segmentation model using DINOv3 encoders for high-fidelity feature extraction, trained with errors detected using reference alpha mattes and standard metrics (MAD, Grad), and addressing class imbalance with focal and dice losses.

Figure 2: Given an input tuple of video frame $I_{rgb}$ , predicted matte, and segmentation mask, MQE outputs a pixel-wise map of reliable and erroneous regions.

MQE enables two scaling mechanisms:

Online Quality Guidance: During model training, MQE provides a spatial penalty mask over alpha predictions, supplying robust, adaptive supervision without requiring ground-truth mattes. This replaces the weak, unsupervised boundary losses previously used, leading to improved semantic and detail accuracy in both core and ambiguous boundary regions.
Offline Data Curation: MQE acts as a quality arbiter for fusing predictions from complementary video- and image-based matting annotation pipelines. This enables automated construction of high-quality matting datasets, by blending temporally stable but “coarse” video model output with detailed but semantically inconsistent image model predictions, guided at the pixel level for reliability.
Figure 3: MQE accurately highlights low-quality boundaries and semantic errors in the core, enabling loss targeting and reliable data annotation.

VMReal: Large-Scale, High-Quality Real-World Matting Data

Leveraging the dual-branch annotation pipeline and MQE-based quality fusion, MatAnyone 2 constructs VMReal: a large-scale, annotated, real-world video matting dataset (28K clips, 2.4M frames), far surpassing previous resources in scale and diversity. This dataset covers a broad range of human appearance, lighting, motion, and challenging environmental conditions.

Figure 4: The dual-branch annotation pipeline merges semantic stability (video branch) with high-fidelity detail (image branch) using MQE-driven fusion, populating VMReal with both high-quality alpha mattes and reliability maps.

Figure 5: VMReal includes diverse, challenging cases (close-ups, motion, low-light), with alpha mattes and evaluation maps for robust, detailed model supervision.

The annotation process mitigates the limitations of each branch: video matting models provide temporal consistency but lack fine detail at boundaries, while image matting models offer sharp details but temporally unstable semantics. The fusion enabled by MQE allows scaling to millions of annotated frames with both qualities preserved.

Figure 6: The annotation fusion process replaces erroneous boundary regions from video matting with reliable details from the image branch, as flagged by MQE, achieving matting annotations that are both temporally coherent and detail-preserving.

Reference-Frame Strategy for Large Appearance Variation

To further address large intra-video appearance changes (e.g., new body parts or objects appearing in long sequences), MatAnyone 2 introduces a reference-frame memory mechanism. This expands the model's temporal context by sampling distant frames outside local mini-batch windows while employing random dropout augmentation to prevent memorization, supporting the learning of long-range correspondences without excessive GPU memory overhead.

Figure 7: The reference-frame strategy enables correct matting of previously unseen regions in long videos, where baseline models fail to adapt.

Figure 8: MatAnyone 2’s robustness to unseen body parts and handheld object appearance in extended clips exceeds baseline models.

Empirical Results

Comprehensive benchmarks on both synthetic (VideoMatte, YouTubeMatte) and real-world datasets (CRGNN) demonstrate superiority across all canonical matting metrics: mean absolute difference (MAD), mean squared error (MSE), gradient error (Grad), connectivity (Conn), and temporal stability (dtSSD). The approach consistently outperforms existing auxiliary-free, diffusion-based, and mask-guided models, including RVM, GVM, AdaM, FTP-VM, MaGGIe, and the original MatAnyone.

Key empirical findings:

Reductions of $27.1\%$ in Grad and $22.4\%$ in Conn over the previous best mask-guided methods.
Across real-world CRGNN benchmarks, lowest error rates in all metrics, indicating strong generalization.
Qualitative results validate improved detail, robustness to lighting/hair motion, and semantic segmentation in challenging settings.
Figure 9: On unconstrained real-world video input, MatAnyone 2 consistently yields finer detail and more accurate semantics compared to leading alternative methods.

Figure 10: Additional qualitative evidence of improved detail, boundary quality, and semantic stability, especially under adverse or complex conditions.

Theoretical and Practical Implications

Theoretical Impact:

MQE reframes matting supervision as a dynamic, adaptive quality-assessment problem—enabling scalable training, reliable data fusion, and annotation without manual ground-truth mattes. This paradigm may generalize to other ill-posed, boundary-sensitive pixel prediction tasks, suggesting future avenues for learned evaluators to replace or augment hand-crafted loss functions and dataset curation strategies.

Practical Applications:

The resulting unified approach provides production-quality matting for visual effects, real-time video editing, and AR/VR with minimal manual annotation effort. The large-scale VMReal resource directly benefits commercial and open-source development of VM and segmentation models.

Future Directions

Iterative, closed-loop refinement of the annotation/data pipeline: enhanced matting models can be leveraged to progressively improve dataset quality, creating a positive feedback loop between data and model.
Extending the MQE approach to other video layer/separation problems (e.g., optical flow, video segmentation) where ground-truth label scarcity is an issue.
Exploration of transformer/diffusion-based matting models using the combination of VMReal and MQE-driven supervision to push boundary detail and temporal stability even further.

Conclusion

MatAnyone 2 establishes a new paradigm for large-scale, high-fidelity video matting by unifying dynamic, learned quality evaluation, automated hybrid annotation, and long-range data-efficient training. The approach addresses key failures of prior work (domain gap, poor boundaries, weak supervision) and provides both theoretical and practical foundation for scaling up video matting and related dense prediction tasks. The creation of the VMReal dataset is likely to be a catalyst for further advances in this area.