Papers
Topics
Authors
Recent
Search
2000 character limit reached

CelebAMat: Synthetic Face Matting

Updated 4 July 2026
  • CelebAMat is a synthetic face matting dataset with soft alpha annotations designed for occlusion-aware face transformation applications.
  • It employs a compositing pipeline that integrates clean face data with diverse occluders using stochastic transformations to simulate realistic occlusions.
  • The benchmark supports both matting and segmentation metrics, enabling robust evaluation and model training for face filters, swapping, and editing tasks.

Searching arXiv for the specified paper to ground the article and citation. CelebAMat is a synthetic face matting dataset introduced to make occlusion-aware face matting trainable and benchmarkable in the context of face filters, face swapping, stylization, and related face transformation pipelines. It is constructed by compositing clean, occlusion-free faces with diverse occluding objects and materials, together with corresponding soft alpha mattes, and it is explicitly designed for learning face matting and evaluating models under diverse, difficult face occlusions (Cho et al., 5 Aug 2025). Within the associated formulation, face matting is defined as predicting an alpha matte that separates occluding elements from facial regions so that downstream transformations can be applied correctly under hands, hair, accessories, and semi-transparent occlusions.

1. Definition and task formulation

CelebAMat is presented as a novel benchmark dataset for evaluating face matting models under diverse and challenging occlusion scenarios, and it also serves as the main training set for the FaceMat framework (Cho et al., 5 Aug 2025). Its stated purpose is to bridge a gap between two pre-existing resource types. Face occlusion datasets such as RealOcc and FaceOcc provide segmentation labels for occluded faces, but not matting-level alpha annotations with soft boundaries or semi-transparency. Generic matting datasets such as SIMD and AM2k provide high-quality alpha mattes, but they are not tailored to faces under complex, realistic occlusions and are not designed around occlusion-aware face filtering (Cho et al., 5 Aug 2025).

The underlying task is defined as face matting: given an image of a face with occlusions, predict an alpha matte that separates occluders from facial skin. This framing is central because the target is not merely binary segmentation. The dataset is intended to support alpha estimation with intermediate values in [0,1][0,1], including soft transitions at boundaries and semi-transparent mixtures (Cho et al., 5 Aug 2025).

The general matting equation used to ground the construction is

Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].

In CelebAMat’s use of this formulation, a clean face acts as one layer, occlusions as another, and the stored ground-truth alpha is the face matting alpha that defines how much of each pixel belongs to facial skin versus occlusion or background (Cho et al., 5 Aug 2025).

2. Dataset construction pipeline

CelebAMat is purely synthetic: all occluded-face images and their ground-truth alpha mattes are generated via compositing (Cho et al., 5 Aug 2025). The clean face source is CelebAMask-HQ, from which samples with visible occlusions are removed using refined annotations and splits from RealOcc/NatOcc. The resulting clean-face pool contains 24,602 training images and 716 test images (Cho et al., 5 Aug 2025).

The occlusion sources are drawn from multiple datasets. The paper states that the construction uses SIMD and AM2k as matting datasets, HIU-data as a hand segmentation dataset for realistic hand occlusions, and DTD as a texture dataset to increase diversity through textured occlusion or background-like patterns (Cho et al., 5 Aug 2025). For HIU masks, a Gaussian blur to the boundaries is explicitly applied to obtain softer transitions and more realistic matting boundaries, so the resulting alpha is not purely binary at edges (Cho et al., 5 Aug 2025).

The compositing pipeline is summarized in the paper as follows. It starts from a clean face image FfaceF_{\text{face}} from CelebAMask-HQ, samples an occlusion image IoccI_{\text{occ}} and its corresponding mask or alpha from {SIMD, AM2k, HIU, DTD}, applies stochastic transformations to the occluder, and composites the occluder onto the face to create an occluded face (Cho et al., 5 Aug 2025). The stochastic transformations include random size, orientation, and position. During video training, although based on still datasets, motion is synthesized by per-frame affine transforms, resizing, horizontal flips, color jitter, and random “pauses” following RVM to emulate motion of the occluding object (Cho et al., 5 Aug 2025).

The corresponding soft alpha matte is generated by blending the face mask and occlusion mask, enabling pixel-level soft mask ground truth for training (Cho et al., 5 Aug 2025). The paper does not provide a closed-form blending equation for this step, but it states that the final alpha is derived from the CelebAMask-HQ facial segmentation together with object masks or alphas from the occlusion datasets, including naturally soft or Gaussian-blurred edges (Cho et al., 5 Aug 2025). This suggests that CelebAMat is designed not only to encode semantic separation between face and occluder, but also to preserve physically relevant ambiguity at boundaries.

3. Annotations, label semantics, and alpha definition

For each synthetic sample, CelebAMat provides the occluded face image II and the ground-truth alpha matte αgt\alpha_{\text{gt}} for face matting, defined as skin foreground versus occlusion or background (Cho et al., 5 Aug 2025). The construction pipeline also necessarily uses the original clean face and occluder masks, although the paper does not explicitly state whether those intermediate assets are exposed in the released dataset (Cho et al., 5 Aug 2025).

The labeling scheme is more specific than generic portrait matting. The paper’s consistent formulation defines the skin area as foreground, encompassing all facial components, while the background consists of the remaining regions outside the foreground (Cho et al., 5 Aug 2025). In operational terms, the alpha is a face-skin matte:

  • α1\alpha \approx 1: pixels belonging to skin or tightly attached facial components.
  • α0\alpha \approx 0: occluding objects, hair, ears, or background.
  • 0<α<10 < \alpha < 1: mixed regions arising from soft boundaries, motion blur, or semi-transparent occlusions (Cho et al., 5 Aug 2025).

The conceptual inclusion and exclusion rules are also stated explicitly. The skin region includes skin and the core face area such as eyes, nose, and mouth. Occlusions are defined as elements that should remain in front of the face when recomposing, including hair, hands, ears, accessories, heavy makeup, and some semi-transparent elements such as smoke and fire. An exception is made for “purely transparent lenses without shadows or color,” which are treated as non-occlusions (Cho et al., 5 Aug 2025).

A potential source of confusion is addressed in the paper itself. Although one part of the introduction mentions treating facial skin as background and occlusions as foreground, the more precise and consistent formulation in the “Definition of Face Matting” subsection defines skin as the foreground FF and occlusions plus everything else as the background Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].0 (Cho et al., 5 Aug 2025). In practice, the dataset’s alpha target remains a skin matte even when the downstream compositing objective is to preserve occluders in front of a transformed face.

4. Scale, diversity, and benchmark protocol

CelebAMat is designed to maximize occlusion diversity while retaining face-domain specificity (Cho et al., 5 Aug 2025). The face diversity is inherited from CelebAMask-HQ, including multiple identities, genders, hair styles, poses, lighting conditions, and high-resolution imagery. The occlusion diversity is obtained from hands in HIU, generic matting objects from SIMD and AM2k, and texture patches from DTD (Cho et al., 5 Aug 2025).

The benchmark protocol separates training and evaluation splits at both the face and occluder levels. On the face side, the dataset uses 24,602 clean training faces and 716 clean test faces. On the occluder side, training always draws occlusion instances strictly from the training splits of SIMD, AM2k, HIU, and DTD, whereas evaluation uses only the test splits of those datasets. For the CelebAMat benchmark, the occlusion configurations for test are fixed to form a consistent benchmark (Cho et al., 5 Aug 2025).

The paper further states that benchmarking uses four test sets, “SIMD, AM2k, HIU, Rand,” generated by different combinations of occlusion types and placements (Cho et al., 5 Aug 2025). The ablation studies indicate that using all four occlusion sets improves performance, and they systematically vary the occlusion ratio, defined as the fraction of face area covered. The reported best results occur at ratio 0.25, described as moderate occlusion (Cho et al., 5 Aug 2025).

For evaluation, CelebAMat supports both matting and segmentation metrics. The matting metrics are computed within the unknown region of a trimap derived from Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].1 and include MSE, SAD, Grad, and Conn. The segmentation metrics, derived from alpha, include IoU and pixel-wise accuracy (Cho et al., 5 Aug 2025). The dataset therefore functions both as a supervised source of soft alpha labels and as a controlled benchmark with multiple occlusion configurations.

5. Role in FaceMat training and uncertainty-guided supervision

CelebAMat is tightly coupled to the FaceMat training design, including teacher-student training, uncertainty modeling, and evaluation (Cho et al., 5 Aug 2025). In Stage 1, the teacher model is trained on synthetic video-like sequences generated from CelebAMat samples. For each base face and occlusion configuration, short sequences are fabricated by transforming the occluding object across frames. The teacher uses the ground-truth alpha and trimaps derived from that alpha to learn both alpha prediction and pixel-wise uncertainty (Cho et al., 5 Aug 2025).

Within the unknown trimap region, the Stage 1 loss includes an Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].2 regression term, a pyramid Laplacian loss for multi-scale edge and detail consistency, and a temporal consistency loss for generated video sequences. The teacher additionally predicts alpha mean and variance Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].3 and uncertainty mean and variance Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].4, trained via negative log-likelihood losses (Cho et al., 5 Aug 2025):

Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].5

and

Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].6

The full teacher loss is

Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].7

In Stage 2, the student model is trimap-free and learns directly from CelebAMat alpha targets together with teacher uncertainty. The uncertainty-weighted Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].8 objective is given as

Ii=αiFi+(1αi)Bi,αi[0,1].I_i = \alpha_i F_i + (1-\alpha_i)B_i, \quad \alpha_i \in [0,1].9

with

FfaceF_{\text{face}}0

where FfaceF_{\text{face}}1 and FfaceF_{\text{face}}2 (Cho et al., 5 Aug 2025). Higher teacher uncertainty therefore produces stronger supervision weight in ambiguous regions. The Stage 2 objective is

FfaceF_{\text{face}}3

The paper states that the teacher is updated via EMA of the student parameters using CelebAMat as the common data stream (Cho et al., 5 Aug 2025). In this design, CelebAMat is not merely a passive dataset. Its alpha labels, derived trimaps, and compositional ambiguity directly structure the optimization problem.

6. Relation to prior resources, applications, and limitations

Relative to general matting datasets such as SIMD and AM2k, CelebAMat combines a face-specific source domain from CelebAMask-HQ with matting-level occluders, and it is structured to simulate realistic face occlusions such as hands across the face, hair over the forehead, and accessories over the eyes (Cho et al., 5 Aug 2025). Relative to face occlusion datasets such as RealOcc and FaceOcc, it is synthetic but large-scale, systematically controllable, and equipped with per-pixel alpha rather than only binary or categorical segmentation labels (Cho et al., 5 Aug 2025). The paper also notes that trimaps are used only for training the teacher, not at test time for the student model (Cho et al., 5 Aug 2025).

The reported downstream use cases are all tied to the dataset’s alpha definition. In face filters / AR effects, stylization, color filters, or makeup filters can be applied to skin while occluders remain in front of the stylized face. In face swapping, the alpha matte separates original face content from occlusions so that transformed facial content can be recomposited correctly. In face completion and editing, alpha-based isolation of occluders can support inpainting of occluded facial parts prior to recognition, reconstruction, or manipulation (Cho et al., 5 Aug 2025).

The empirical claims reported for CelebAMat emphasize its training value. The paper states that training on CelebAMat gives a strong baseline for multiple existing matting models, that RVM trained from scratch on CelebAMat becomes the best baseline on the CelebAMat benchmark, and that the proposed FaceMat improves over RVM in MSE, SAD, Grad, Conn, IoU, and pixel accuracy (Cho et al., 5 Aug 2025). The ablations further report that using all occlusion sets improves performance and that a moderate occlusion ratio of 0.25 yields the best generalization (Cho et al., 5 Aug 2025). A plausible implication is that the benchmark is useful not only for model comparison but also for studying the interaction between occlusion composition and inductive bias in matting systems.

The limitations are stated clearly. CelebAMat is synthetic: faces are real, but occlusions are composited, which creates a domain gap relative to fully natural, in-the-wild occlusions. Some real-world artifacts, including lighting inconsistencies and subtle 3D occlusion cues, may not be fully captured. In addition, CelebAMask-HQ annotations are not matting-accurate at very fine detail, so thin structures such as glasses or fine facial hair may be imperfectly delineated, although the compositing and blurring steps partially mitigate this issue (Cho et al., 5 Aug 2025). The paper nevertheless reports good generalization to RealOcc and notes that maintaining temporal inpainting consistency in video remains an open challenge beyond alpha prediction (Cho et al., 5 Aug 2025).

The abstract states that the source code and CelebAMat dataset are available at the FaceMat GitHub repository, indicating public release through that channel (Cho et al., 5 Aug 2025). The main text does not spell out exact licensing terms, but it does make clear that the dataset is derived from CelebAMask-HQ, SIMD, AM2k, HIU-data, and DTD, so any downstream use must remain compatible with those upstream resources (Cho et al., 5 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CelebAMat.