GenCAMO-DB: Large-Scale Camouflage Dataset
- GenCAMO-DB is a comprehensive dataset defined by 34,200 images with rich multi-modal annotations for camouflage scene analysis.
- It integrates RGB, depth maps, scene graphs, structured captions, and generated masks to support tasks like CIG, S2RCDP, COD, and OVCOS.
- Benchmark results demonstrate its effectiveness, achieving lower FID/KID scores and improved detection metrics compared to previous datasets.
GenCAMO-DB is a large-scale, publicly released dataset designed to advance research in complex camouflage scene understanding and dense prediction. It provides 34,200 still images annotated with multi-modal information—including RGB frames, depth maps, scene graphs, fine-grained attribute lists, and structured text prompts—to train and benchmark models for tasks such as camouflage image–mask generation (CIG), synthetic-to-real camouflage dense prediction (S2RCDP), camouflage object detection (COD), and open-vocabulary camouflage object segmentation (OVCOS). GenCAMO-DB incorporates both real and synthetic images acquired or generated from twelve open-source collections, under a mask-free pipeline optimized for rich annotation and broad contextual diversity (Chen et al., 3 Jan 2026).
1. Dataset Composition and Annotation Modalities
GenCAMO-DB comprises 34,200 images collected and synthesized from three principal sources: open-domain RGB datasets with semantic graphs (including COCO-Stuff and Visual Genome), camouflage-image benchmarks, and salient/general segmentation data from LAKERED. For CIG and S2RCDP tasks, a dedicated GenCAMO-DB-LAKERED split covers 4,040 training and 12,946 test images, ensuring balanced representation across “concealed,” “salient,” and “general” contexts at a roughly 1:3 ratio.
Each sample is annotated under four dense modalities:
- Scene Graphs: Stored in JSON, scene graphs model object categories (, e.g., “chameleon”, “leaf”) and relations (, e.g., “hides behind”, “contacts”). Directed edges are enriched as quintuples , incorporating source/target concealment attributes. Embeddings include (object IDs), (attributes), and (relations).
- Concealment Attributes: Each object is assigned attributes —drawn from a closed vocabulary describing color, pattern, and texture. Top-15 attributes are: green, brown, rough, textured, speckled, mottled, smooth, grey, yellow, granular, striped, rugged, dappled, tarnished, and shiny. These are stored in the scene-graph JSON.
- Text Prompts: Each image receives a single GPT-4o–generated caption under a structured template emphasizing subject–verb–object (SVO) syntax, concealment cues, environmental context, and explicit spatial/contact relations. GenCAMO-DB yields approximately 612,500 words of caption text.
- Foreground Masks: Although mask-free at generation, a diffusion-based decoder (DiffuMask-style) followed by SAM2 refinement produces approximate segmentation masks as 8-bit PNGs.
All images are provided at 512×512 px for generative/benchmarking purposes, with original resolutions preserved for further annotation.
2. Data Generation and Quality Assurance
Data acquisition leverages a semi-automatic annotation pipeline. Key steps include:
- Selection of camouflage-like scenes from open-domain RGB datasets containing scene graphs.
- Augmentation of existing camouflage benchmarks with generated depth, scene graphs, and captions.
- Extension of SOD and SEG samples from LAKERED with camouflage-style annotations.
Depth maps are predicted via Depth Anything; scene graphs via Universal SG, then verified and refined for camouflage-relevant relations; captions are generated by GPT-4o under a structured template. Each sample undergoes 5–10 minutes of human verification, inspected for cross-modal consistency and camouflage-likelihood. Samples failing modality-alignment are re-annotated.
The GenCAMO generator, based on Stable Diffusion v1.5, ControlNet, and OpenCLIP ViT-H/14, synthesizes camouflage image–annotation triplets. Its generative pipeline incorporates:
- Depth–Layout Coherence Guided ControlNet (DLCG): Fuses scene-graph layout and depth features to maintain environment-aware consistency.
- Attribute-aware Mask Attention (AMA): Guarantees pixel-wise attention to correct object–attribute pairs.
- Unified LDM Decoder: Produces image, depth, and mask channels jointly.
Key objectives include:
- Depth–layout coherence loss: , .
- Joint diffusion objective: , with equal weights .
3. File Formats and Access Infrastructure
GenCAMO-DB uses a consistent file organization:
- RGB images: 512×512 PNG, stored in
/images/. - Depth maps: 16-bit PNG, stored in
/depth/. - Scene graphs: JSON lists of nodes and relations, in
/scene_graphs/. - Captions: Text files, one sentence each, in
/captions/. - Masks: 8-bit PNG, in
/masks/.
Each file is indexed by a unique 4- or 5-digit ID (e.g., 00023.png, 00023_depth.png, 00023.json). Dataset access is facilitated via a PyTorch DataLoader and a command-line API, permitting iteration over (image, depth, graph, prompt, mask) tuples.
Users may re-partition the dataset, for example into a conventional 70/10/20 train/val/test split, to suit specific training, validation, and test requirements, including customized balancing of camouflage difficulty.
4. Benchmarking Protocols and Results
GenCAMO-DB supports two principal benchmark families:
- Camouflage Image–Mask Generation (CIG): Evaluated by Frechet Inception Distance (FID) and Kernel Inception Distance (KID). On the test split, GenCAMO achieves FID=18.49, KID=0.0025, outperforming LAKERED (FID=64.27, KID=0.0355) and MIP-Adapter (FID=68.26, KID=0.0391).
- Synthetic-to-Real Camouflage Dense Prediction (S2RCDP): Encompasses RGB COD, RGB-D COD, and OVCOS.
- RGB/RGB-D COD metrics: MAE↓, S-measure ↑, E-measure ↑, Weighted F-measure ↑. Fine-tuned SINet-v2 + CSRDA on GenCAMO synthetic data achieves , , , , surpassing LAKERED baselines (, ).
- OVCOS metrics: , , , . OV-Camo trained on GenCAMO alone: , , , ; combining real + GenCAMO sets new state of the art (, , , ).
A plausible implication is that synthetic data from GenCAMO-DB can significantly enhance model generalization in camouflage dense prediction benchmarks.
5. Annotation Schema and Attribute Statistics
GenCAMO-DB’s four annotation modalities are summarized below:
| Modality | Description | File Format/Location |
|---|---|---|
| RGB Image | 512×512 color image | PNG, /images/ |
| Depth Map | 16-bit predicted depth | PNG, /depth/ |
| Scene Graph | Nodes (objects), enriched edges (relations) | JSON, /scene_graphs/ |
| Caption | SVO-structured image description (~612,500 words total) | TXT, /captions/ |
| Mask | Approximate foreground mask | PNG, /masks/ |
Attribute frequency is led by “green,” “brown,” “rough,” and similar color/texture descriptors. Scene graphs articulate not only object categories and spatial relations but also pairwise concealment attributes, encapsulated in quintuple-form edges. This multi-modal schema is optimized to support environment-aware, contextually rich training and benchmarking.
6. Applications and Prospective Extensions
GenCAMO-DB is suited for fine-grained scene understanding under occlusion, with demonstrable utility in agricultural pest monitoring, industrial defect inspection, ecological biodiversity assessment, and augmented-reality concealment mechanisms. Potential extensions include integration of thermal/multispectral modalities, temporal data (video camouflage), 3D point-cloud annotation, dynamic scene graphs, broader environmental coverage (e.g., underwater, desert scenes), physics-based lighting priors, and human-in-the-loop refinement.
This suggests an evolving utility of GenCAMO-DB as a foundational resource for diverse occlusion-centric tasks and multimodal scene analysis in complex domains (Chen et al., 3 Jan 2026).