Papers
Topics
Authors
Recent
2000 character limit reached

GenCAMO-DB: Large-Scale Camouflage Dataset

Updated 10 January 2026
  • GenCAMO-DB is a comprehensive dataset defined by 34,200 images with rich multi-modal annotations for camouflage scene analysis.
  • It integrates RGB, depth maps, scene graphs, structured captions, and generated masks to support tasks like CIG, S2RCDP, COD, and OVCOS.
  • Benchmark results demonstrate its effectiveness, achieving lower FID/KID scores and improved detection metrics compared to previous datasets.

GenCAMO-DB is a large-scale, publicly released dataset designed to advance research in complex camouflage scene understanding and dense prediction. It provides 34,200 still images annotated with multi-modal information—including RGB frames, depth maps, scene graphs, fine-grained attribute lists, and structured text prompts—to train and benchmark models for tasks such as camouflage image–mask generation (CIG), synthetic-to-real camouflage dense prediction (S2RCDP), camouflage object detection (COD), and open-vocabulary camouflage object segmentation (OVCOS). GenCAMO-DB incorporates both real and synthetic images acquired or generated from twelve open-source collections, under a mask-free pipeline optimized for rich annotation and broad contextual diversity (Chen et al., 3 Jan 2026).

1. Dataset Composition and Annotation Modalities

GenCAMO-DB comprises 34,200 images collected and synthesized from three principal sources: open-domain RGB datasets with semantic graphs (including COCO-Stuff and Visual Genome), camouflage-image benchmarks, and salient/general segmentation data from LAKERED. For CIG and S2RCDP tasks, a dedicated GenCAMO-DB-LAKERED split covers 4,040 training and 12,946 test images, ensuring balanced representation across “concealed,” “salient,” and “general” contexts at a roughly 1:3 ratio.

Each sample is annotated under four dense modalities:

  • Scene Graphs: Stored in JSON, scene graphs G=(O,E)G=(O, E) model object categories (oiOo_i \in O, e.g., “chameleon”, “leaf”) and relations (eijEe_{ij} \in E, e.g., “hides behind”, “contacts”). Directed edges are enriched as quintuples tij=(ai,oi,eij,oj,aj)t_{ij} = (a_i, o_i, e_{ij}, o_j, a_j), incorporating source/target concealment attributes. Embeddings include EemboE^o_{emb} (object IDs), EembaE^a_{emb} (attributes), and EembeE^e_{emb} (relations).
  • Concealment Attributes: Each object is assigned attributes aia_i—drawn from a closed vocabulary describing color, pattern, and texture. Top-15 attributes are: green, brown, rough, textured, speckled, mottled, smooth, grey, yellow, granular, striped, rugged, dappled, tarnished, and shiny. These are stored in the scene-graph JSON.
  • Text Prompts: Each image receives a single GPT-4o–generated caption CpC_p under a structured template emphasizing subject–verb–object (SVO) syntax, concealment cues, environmental context, and explicit spatial/contact relations. GenCAMO-DB yields approximately 612,500 words of caption text.
  • Foreground Masks: Although mask-free at generation, a diffusion-based decoder (DiffuMask-style) followed by SAM2 refinement produces approximate segmentation masks as 8-bit PNGs.

All images are provided at 512×512 px for generative/benchmarking purposes, with original resolutions preserved for further annotation.

2. Data Generation and Quality Assurance

Data acquisition leverages a semi-automatic annotation pipeline. Key steps include:

  • Selection of camouflage-like scenes from open-domain RGB datasets containing scene graphs.
  • Augmentation of existing camouflage benchmarks with generated depth, scene graphs, and captions.
  • Extension of SOD and SEG samples from LAKERED with camouflage-style annotations.

Depth maps are predicted via Depth Anything; scene graphs via Universal SG, then verified and refined for camouflage-relevant relations; captions are generated by GPT-4o under a structured template. Each sample undergoes 5–10 minutes of human verification, inspected for cross-modal consistency and camouflage-likelihood. Samples failing modality-alignment are re-annotated.

The GenCAMO generator, based on Stable Diffusion v1.5, ControlNet, and OpenCLIP ViT-H/14, synthesizes camouflage image–annotation triplets. Its generative pipeline incorporates:

  • Depth–Layout Coherence Guided ControlNet (DLCG): Fuses scene-graph layout and depth features to maintain environment-aware consistency.
  • Attribute-aware Mask Attention (AMA): Guarantees pixel-wise attention to correct object–attribute pairs.
  • Unified LDM Decoder: Produces image, depth, and mask channels jointly.

Key objectives include:

  • Depth–layout coherence loss: di=minm(1S(FQ(i),pm))d_i = \min_m (1- S(F_Q(i), p_m)), LDLC=1Ni=1NdiL_{DLC} = \frac{1}{N} \sum_{i=1}^N d_i.
  • Joint diffusion objective: LLDM=Ez,ϵN(0,I),t[ϵϵθ(zt,t,τ^,FQ)22]L_{LDM} = \mathbb{E}_{z, \epsilon \sim N(0,I), t}[ \Vert \epsilon - \epsilon_\theta(z_t, t, \hat{\tau}^\prime, F_Q) \Vert_2^2 ], with equal weights λ1=λ2=1\lambda_1 = \lambda_2 = 1.

3. File Formats and Access Infrastructure

GenCAMO-DB uses a consistent file organization:

  • RGB images: 512×512 PNG, stored in /images/.
  • Depth maps: 16-bit PNG, stored in /depth/.
  • Scene graphs: JSON lists of nodes and relations, in /scene_graphs/.
  • Captions: Text files, one sentence each, in /captions/.
  • Masks: 8-bit PNG, in /masks/.

Each file is indexed by a unique 4- or 5-digit ID (e.g., 00023.png, 00023_depth.png, 00023.json). Dataset access is facilitated via a PyTorch DataLoader and a command-line API, permitting iteration over (image, depth, graph, prompt, mask) tuples.

Users may re-partition the dataset, for example into a conventional 70/10/20 train/val/test split, to suit specific training, validation, and test requirements, including customized balancing of camouflage difficulty.

4. Benchmarking Protocols and Results

GenCAMO-DB supports two principal benchmark families:

  • Camouflage Image–Mask Generation (CIG): Evaluated by Frechet Inception Distance (FID) and Kernel Inception Distance (KID). On the test split, GenCAMO achieves FID=18.49, KID=0.0025, outperforming LAKERED (FID=64.27, KID=0.0355) and MIP-Adapter (FID=68.26, KID=0.0391).
  • Synthetic-to-Real Camouflage Dense Prediction (S2RCDP): Encompasses RGB COD, RGB-D COD, and OVCOS.
    • RGB/RGB-D COD metrics: MAE↓, S-measure SmS_m↑, E-measure EmE_m↑, Weighted F-measure FωβF_\omega^\beta↑. Fine-tuned SINet-v2 + CSRDA on GenCAMO synthetic data achieves Sm=0.7874S_m = 0.7874, Fωβ=0.6338F_\omega^\beta = 0.6338, Em=0.8622E_m = 0.8622, MAE=0.0431MAE = 0.0431, surpassing LAKERED baselines (Sm0.7303S_m \approx 0.7303, MAE0.0649MAE \approx 0.0649).
    • OVCOS metrics: cSmcS_m, cFωβcF_\omega^\beta, cMAEcMAE, cEmcE_m. OV-Camo trained on GenCAMO alone: cSm=0.579cS_m = 0.579, cFωβ=0.490cF_\omega^\beta = 0.490, cMAE=0.336cMAE = 0.336, cEm=0.616cE_m = 0.616; combining real + GenCAMO sets new state of the art (cSm=0.589cS_m = 0.589, cFωβ=0.518cF_\omega^\beta = 0.518, cMAE=0.311cMAE = 0.311, cEm=0.657cE_m = 0.657).

A plausible implication is that synthetic data from GenCAMO-DB can significantly enhance model generalization in camouflage dense prediction benchmarks.

5. Annotation Schema and Attribute Statistics

GenCAMO-DB’s four annotation modalities are summarized below:

Modality Description File Format/Location
RGB Image 512×512 color image PNG, /images/
Depth Map 16-bit predicted depth PNG, /depth/
Scene Graph Nodes (objects), enriched edges (relations) JSON, /scene_graphs/
Caption SVO-structured image description (~612,500 words total) TXT, /captions/
Mask Approximate foreground mask PNG, /masks/

Attribute frequency is led by “green,” “brown,” “rough,” and similar color/texture descriptors. Scene graphs articulate not only object categories and spatial relations but also pairwise concealment attributes, encapsulated in quintuple-form edges. This multi-modal schema is optimized to support environment-aware, contextually rich training and benchmarking.

6. Applications and Prospective Extensions

GenCAMO-DB is suited for fine-grained scene understanding under occlusion, with demonstrable utility in agricultural pest monitoring, industrial defect inspection, ecological biodiversity assessment, and augmented-reality concealment mechanisms. Potential extensions include integration of thermal/multispectral modalities, temporal data (video camouflage), 3D point-cloud annotation, dynamic scene graphs, broader environmental coverage (e.g., underwater, desert scenes), physics-based lighting priors, and human-in-the-loop refinement.

This suggests an evolving utility of GenCAMO-DB as a foundational resource for diverse occlusion-centric tasks and multimodal scene analysis in complex domains (Chen et al., 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GenCAMO-DB.