GenCAMO: Generative Camouflage Synthesis
- GenCAMO is a generative, environment-aware framework integrating multi-modal inputs like text, depth, and scene graphs for high-fidelity camouflage synthesis.
- It leverages structured scene-graph conditioning and diffusion-based models to overcome limited annotated datasets and improve dense prediction accuracy.
- The framework achieves superior performance in camouflage object detection and segmentation, with notable improvements in FID, KID, and other evaluation metrics.
GenCAMO denotes a generative, environment-aware, mask-free framework for high-fidelity camouflage image synthesis and dense annotation, specifically targeting advances in conceal dense prediction (CDP)—notably, RGB-D camouflage object detection and open-vocabulary camouflage object segmentation. The method addresses the challenge of limited large-scale, high-quality annotated camouflage datasets by leveraging structured scene-graph conditioning and multi-modal annotation to train diffusion-based generative models, resulting in improved synthetic and synthetic-to-real dense prediction accuracy on complex camouflage scenes (Chen et al., 3 Jan 2026).
1. GenCAMO-DB: Multi-Modal Camouflage Dataset
GenCAMO-DB is a large-scale dataset comprising 34,200 images spanning natural, household, agricultural, and industrial scenes. It provides multi-modal annotation including:
- RGB images;
- Dense, single-channel depth maps (generated by “Depth Anything”);
- Scene graphs with nodes %%%%1%%%% (object categories), edges (relations such as “lies on”, “hides in”) and fine-grained concealment attributes (e.g. color, pattern, material);
- Text prompts in structured SVO form including color, texture, spatial, and concealment cues (produced by GPT-4o, human-refined);
- Implicit mask annotations, used only for evaluation/refinement.
Images are sourced from COCO-Stuff, Visual Genome, and multiple camouflage/segmentation/saliency datasets (CAMO, COD10K, NC4K, USC12K, LAKERED). Annotation involves a semi-automatic pipeline with human refinement (5–10 minutes per image). The final dataset contains 612,500 words of text and 102,600 scene-graph tuples, with images split for generative model training ( K) and evaluation (e.g., 4040 for “camouflage + salient” training, 12,946 for testing).
2. Overall GenCAMO Architecture and Data Flow
GenCAMO builds upon Stable Diffusion v1.5 and ControlNet, introducing multi-modal conditioning that integrates reference images, text prompts, depth maps, and scene graphs. The dataflow proceeds as follows:
- Text prompts are embedded using frozen OpenCLIP ViT-H/14: .
- Reference image encoded by the CLIP-image branch: .
- Depth map encoded via a compact CNN (“VisualEnc”): .
- Scene graph objects, relations, and attributes produce embeddings , , .
- Scene-graph modules:
- Depth-Layout Coherence Guided ControlNet (DLCG)
- Attribute-Aware Mask Attention (AMA)
- Cross-attention fuses text, image, depth, and scene-graph tokens.
- Diffusion U-Net predicts noise where ControlNet injects .
- Joint decoders generate the camouflage image, predicted depth map (MSE-trained), and coarse mask (DiffuMask style).
- Post-refinement uses “Depth Anything” and SAM2 on generated depth/masks.
This architecture enables mask-free, geometry- and context-conscious generation.
3. Scene-Graph Contextual Decoupling and Mask-Free Generation
Scene-Graph Representation:
- Graph encodes object semantics and spatial relationships, with each node/edge tuple input into a GCN for attribute and context propagation.
Depth-Layout Coherence Guided ControlNet (DLCG):
- Depth encoding: .
- Layout encoding: .
- Fusion: with learned .
- Learnable tokens with cross-attention yield scene prototypes .
- Depth-layout coherence enforced by minimizing cosine distance from each to nearest prototype .
Attribute-Aware Mask Attention (AMA):
- Semantic encoding: .
- Object-level: for entities, otherwise.
- Features fused under a mask (1 if share entity, otherwise).
- No ground-truth foreground masks are provided; mask decoder uses self-supervised DiffuMask loss, with final masks refined by SAM2.
This design decouples object/scene context and enforces semantic/geometry alignment without explicit mask supervision during training.
4. Mathematical Foundations
Depth-Layout Coherence Loss
- Feature fusion: .
- Distance to prototype: .
- Loss: .
Attribute-Aware Mask Attention & Masking
- for , otherwise.
- Mask if tokens belong to same object, otherwise.
Diffusion Objective and Total Loss
- U-Net noise: .
- Latent diffusion loss:
- Total loss:
(empirically .)
5. Training Details
- Base: Stable Diffusion v1.5 + ControlNet.
- Image encoder: OpenCLIP ViT-H/14 (frozen).
- Optimizer: AdamW with decoupled weight decay.
- Default SD-1.5 hyperparameters (learning rate , 250K steps).
- Training set: all 34.2K GenCAMO-DB images.
- DLCG & AMA heads fine-tuned for 50K steps.
- Compute: 8A100 (40GB) GPUs, 3 days.
6. Experimental Results and Evaluation
Camouflage Image-Mask Generation (CIG)
- Metrics: FID (Fréchet Inception Distance), KID, on camouflaged, salient, general categories.
- Baselines: CI, DCI, LCGNet, LDM, LAKERED, “Camouflage Anything”, MIP-Adapter.
- GenCAMO (I+T+D) achieves best FID = 18.49, KID = 0.0123.
Synthetic-to-Real Dense Prediction (S2RCDP)
- RGB/Depth COD: SINet, SINet-v2, RISNet+CSRDA. GenCAMO (syntheticreal) improves S-measure, F-measure, E-measure; MAE drops 0.065 → 0.043.
- Open-Vocabulary Camouflage Segmentation (OVCOS): OVCamo model.
- GenCAMO only: = 0.579, = 0.490.
- Real+GenCAMO: = 0.589, = 0.518 (best results).
Ablation
- Baseline (no DLCG/AMA): FID = 54.32, KID = 0.0239.
- AMA only: 43.45/0.0172.
- DLCG only: 42.57/0.0192.
- DLCG+AMA: 38.45/0.0123.
- Qualitative: improved semantic alignment, depth-consistent geometry, smoother occlusion blending, more accurate segmentation masks.
7. Limitations and Future Directions
Observed limitations include local artifacts under complex illumination (e.g., metallic glare, red goggles) and difficulty in accurate rendering under multiple light sources or hard shadows. Directions for future work involve incorporation of physics-aware lighting priors, higher-resolution scene-graph reasoning, and improved cross-modal feature alignment.
GenCAMO synthesizes advances in dataset scale and semantic richness (GenCAMO-DB) with a mask-free, multi-modal, scene-graph-guided diffusion generative architecture, providing superior performance and generalization for camouflage image synthesis and downstream dense prediction benchmarks (Chen et al., 3 Jan 2026).