AlphaLayers RGBA Dataset
- AlphaLayers is a multi-layer RGBA image dataset featuring 1,000 triplets with foreground, background, and composite images along with detailed pixel-level masks.
- The dataset is created using a rigorous synthesis and filtering pipeline that employs Qwen3-VL and ObjectClear to ensure high consistency and quality.
- It supports diverse tasks such as text-to-image generation, image matting, object removal, and layer decomposition, setting new performance benchmarks.
AlphaLayers is a multi-layer RGBA image dataset specifically constructed to support the development and evaluation of unified, multi-task RGBA generative models. Designed for both image synthesis and editing tasks that require explicit manipulation of layer structure—including matting, inpainting, object removal, decomposition, and compositional generation—AlphaLayers assembles high-quality triplets (foreground, background, and composite) with detailed pixel-level masks and aligned textual descriptions. Its rigorous synthesis and filtering pipeline yields a clean, consistent benchmark for the training of sequence-to-sequence diffusion frameworks such as OmniAlpha, and directly addresses the limitations of conventional RGB datasets for layered, transparency-aware research (Yu et al., 25 Nov 2025).
1. Dataset Composition and Structure
AlphaLayers contains 1,000 triplets, each composed of three tightly aligned RGBA images at resolution:
- Foreground (): An object (often with semi-transparent boundaries) and its continuous alpha matte ().
- Background (): A scene with the foreground object removed, alpha channel everywhere 1.
- Composite (): The standard alpha compositing .
Accompanying these images are:
- Four pixel-level masks: a binary precise mask (), a three-level trimap, a rough mask, and the full continuous alpha map.
- Three aligned captions: describing object, scene, and structured editing instruction.
All data are encoded as 4-channel PNGs.
2. Synthesis and Filtering Pipeline
AlphaLayers is created via an automated, multi-stage pipeline:
- Raw RGBA foreground acquisition: Source single-layer RGBA samples from established matting benchmarks (Adobe, AM-2K, Distinctions-646, etc.).
- Foreground captioning: Use Qwen3-VL to generate a concise text description, .
- Scenario generation: Prompt Qwen3-VL with to obtain a composite scene caption, .
- Composite synthesis: Use Qwen-Image-Edit for “background replacement,” compositing over a generated background to obtain .
- Background recovery: Use ObjectClear to inpaint the object away in , yielding ; also caption ().
- Mask derivation: From , produce precise, trimap, and rough masks.
Triplets are rigorously filtered by a consistency score
where and , with . Only the top 1,000 triplets by are retained (Yu et al., 25 Nov 2025).
3. Annotation Protocol and Layer Metadata
Each AlphaLayers instance includes:
- Alpha and segmentation masks: Continuous alpha, binary "precise" (), three-class trimap, and rough mask.
- Textual descriptions: Object-only (T_fg), composite scene (T_comp), structured editing (T_replace), and background (T_bg).
- Storage and Format: All images resized to , saved in four-channel PNG, with captions and masks bundled per triplet.
All backgrounds are independently synthesized, and object boundaries are sourced directly from high-quality matting masks.
4. Dataset Statistics and Scope
AlphaLayers covers:
- 1,000 triplets total, with a split of 900 for training and 100 for held-out testing (“AlphaLayersTest”).
- Broad domain diversity: portraits, objects, transparent materials, animals, and synthetic composites.
- Backgrounds and scenes generated using LLM-based visual prompting (Qwen3-VL) and object inpainting (ObjectClear).
- Captions ranging from concise (object name) to long, descriptive scene prompts ("rich 40–50 word scene prompts" in the original data).
All masks are derived from ground-truth matting, so mask-quality artifacts are minimal. A limitation is that all examples are single-object foregrounds; multi-object, occlusion-rich scenes are not included.
5. Supported Tasks and Benchmark Protocols
AlphaLayers is expressly structured for multi-task, multi-modal RGBA model development:
- Task Categories (21 total):
1. Text-to-Image Generation. 2. Layer-Conditioned Completion (FG→BG, FG→Comp, BG→FG, BG→Comp). 3. Image Matting (mask-free, alpha/trimap/precise/rough/text-conditioned). 4. Object Removal (foreground extraction, background inpainting). 5. Layer Decomposition (recover both FG and BG from composite).
- Benchmark splits: 900/100 (train/test); OOD generalization on AIM-500, RefMatte-RW100, and RORD.
- Metrics: FID, CLIP-Score, pairwise win rates, SAD, MSE, GRAD, CONN, LPIPS, and PSNR (see Table 2–6 in (Yu et al., 25 Nov 2025)).
- Evaluation protocol: Results are compared to strong baselines (LayerDiffuse, AlphaVAE, MAM, MatAny, TeachDiffusionMatting, LayerDecomp) on all primary and OOD benchmarks.
6. Applications and Limitations
AlphaLayers supports:
- Training and benchmarking of unified diffusion models for RGBA text-to-image, matting, inpainting, and compositional decomposition.
- Pretraining/fine-tuning on end-to-end transparency and layer-aware editing.
- Unambiguous multi-modal (caption/mask/trimap) conditioning for model input.
Limitations:
- Dataset scale—1,000 triplets—is markedly smaller than RGB-focused datasets and limits fine-grained compositional learning.
- All samples are single-object; scenes with multiple overlapping or interacting transparency layers are not present.
- Synthetic backgrounds reflect Qwen3-VL and ObjectClear priors; biases may result from distributional artifacts.
- Fixed resolution and a single pipeline for caption style and compositing.
7. Impact and Availability
AlphaLayers provides the foundation for OmniAlpha, a unified, multi-task RGBA sequence-to-sequence framework that achieves state-of-the-art performance, e.g., an 84.8% reduction in SAD for mask-free matting (AIM-500) and >90% human preference win rate on layer-conditioned completion tasks compared to the previous best methods. The dataset is publicly released for research purposes, with licensing terms matched to the upstream matting datasets (Yu et al., 25 Nov 2025). Its unified structure and strict curation protocol make it the current standard for training and benchmarking next-generation RGBA-aware generative models.