Papers
Topics
Authors
Recent
2000 character limit reached

GenCAMO: Generative Camouflage Synthesis

Updated 10 January 2026
  • GenCAMO is a generative, environment-aware framework integrating multi-modal inputs like text, depth, and scene graphs for high-fidelity camouflage synthesis.
  • It leverages structured scene-graph conditioning and diffusion-based models to overcome limited annotated datasets and improve dense prediction accuracy.
  • The framework achieves superior performance in camouflage object detection and segmentation, with notable improvements in FID, KID, and other evaluation metrics.

GenCAMO denotes a generative, environment-aware, mask-free framework for high-fidelity camouflage image synthesis and dense annotation, specifically targeting advances in conceal dense prediction (CDP)—notably, RGB-D camouflage object detection and open-vocabulary camouflage object segmentation. The method addresses the challenge of limited large-scale, high-quality annotated camouflage datasets by leveraging structured scene-graph conditioning and multi-modal annotation to train diffusion-based generative models, resulting in improved synthetic and synthetic-to-real dense prediction accuracy on complex camouflage scenes (Chen et al., 3 Jan 2026).

1. GenCAMO-DB: Multi-Modal Camouflage Dataset

GenCAMO-DB is a large-scale dataset comprising 34,200 images spanning natural, household, agricultural, and industrial scenes. It provides multi-modal annotation including:

  • RGB images;
  • Dense, single-channel depth maps (generated by “Depth Anything”);
  • Scene graphs G=(O,E)G = (O, E) with nodes %%%%1%%%% (object categories), edges E={eij}E = \{e_{ij}\} (relations such as “lies on”, “hides in”) and fine-grained concealment attributes A={ai}A = \{a_i\} (e.g. color, pattern, material);
  • Text prompts in structured SVO form including color, texture, spatial, and concealment cues (produced by GPT-4o, human-refined);
  • Implicit mask annotations, used only for evaluation/refinement.

Images are sourced from COCO-Stuff, Visual Genome, and multiple camouflage/segmentation/saliency datasets (CAMO, COD10K, NC4K, USC12K, LAKERED). Annotation involves a semi-automatic pipeline with human refinement (5–10 minutes per image). The final dataset contains \sim612,500 words of text and \sim102,600 scene-graph tuples, with images split for generative model training (N=34.2N = 34.2 K) and evaluation (e.g., 4040 for “camouflage + salient” training, 12,946 for testing).

2. Overall GenCAMO Architecture and Data Flow

GenCAMO builds upon Stable Diffusion v1.5 and ControlNet, introducing multi-modal conditioning that integrates reference images, text prompts, depth maps, and scene graphs. The dataflow proceeds as follows:

  1. Text prompts CpC_p are embedded using frozen OpenCLIP ViT-H/14: ECLIP(Cp)E_{CLIP}(C_p).
  2. Reference image CrC_r encoded by the CLIP-image branch: FeF_e.
  3. Depth map CdC_d encoded via a compact CNN (“VisualEnc”): FDF_D.
  4. Scene graph objects, relations, and attributes produce embeddings EemboE^o_{emb}, EembeE^e_{emb}, EembaE^a_{emb}.
  5. Scene-graph modules:
    • Depth-Layout Coherence Guided ControlNet (DLCG)
    • Attribute-Aware Mask Attention (AMA)
  6. Cross-attention fuses text, image, depth, and scene-graph tokens.
  7. Diffusion U-Net predicts noise εθ(zt,t,τ^,FQ)\varepsilon_\theta(z_t, t, \hat{\tau}, F_Q) where ControlNet injects Gϕ(FQ)G_{\phi}(F_Q).
  8. Joint decoders generate the camouflage image, predicted depth map (MSE-trained), and coarse mask (DiffuMask style).
  9. Post-refinement uses “Depth Anything” and SAM2 on generated depth/masks.

This architecture enables mask-free, geometry- and context-conscious generation.

3. Scene-Graph Contextual Decoupling and Mask-Free Generation

Scene-Graph Representation:

  • Graph G=(O,E)G=(O, E) encodes object semantics and spatial relationships, with each node/edge tuple (ai,oi,eij,oj,aj)(a_i, o_i, e_{ij}, o_j, a_j) input into a GCN for attribute and context propagation.

Depth-Layout Coherence Guided ControlNet (DLCG):

  • Depth encoding: FD=VisualEnc(Cd)F_D = \text{VisualEnc}(C_d).
  • Layout encoding: Flay=LayoutDec(EemboEembe)F_{lay} = \text{LayoutDec}(E^o_{emb} \odot E^e_{emb}).
  • Fusion: FQ=FD+FlayWLF_Q = F_D + F_{lay} \cdot W^L with learned WLRc×cW^L \in \mathbb{R}^{c \times c}.
  • Learnable tokens TT with cross-attention yield scene prototypes P={pm}P = \{p_m\}.
  • Depth-layout coherence enforced by minimizing cosine distance from each FQ(i)F_Q(i) to nearest prototype pmp_m.

Attribute-Aware Mask Attention (AMA):

  • Semantic encoding: Fsem=SemanticsDec(EemboEemba)F_{sem} = \text{SemanticsDec}(E^o_{emb} \odot E^a_{emb}).
  • Object-level: c^i=Flay(i)Fsem(i)\hat{c}_i = F_{lay}^{(i)} \odot F_{sem}^{(i)} for entities, cnullc_{null} otherwise.
  • Features [V~C^E^a][\tilde{V} \oplus \hat{C} \oplus \hat{E}^a] fused under a mask MijM_{ij} (1 if i,ji, j share entity, -\infty otherwise).
  • No ground-truth foreground masks are provided; mask decoder uses self-supervised DiffuMask loss, with final masks refined by SAM2.

This design decouples object/scene context and enforces semantic/geometry alignment without explicit mask supervision during training.

4. Mathematical Foundations

Depth-Layout Coherence Loss

  • Feature fusion: FQ=FD+FlayWLF_Q = F_D + F_{lay} \cdot W^L.
  • Distance to prototype: di=minm[1cos(FQ(i),pm)]d_i = \min_m [1 - \cos(F_Q(i), p_m)].
  • Loss: LDLC=1Ni=1NdiL_{DLC} = \frac{1}{N} \sum_{i=1}^N d_i.

Attribute-Aware Mask Attention & Masking

  • c^i=Flay(i)Fsem(i)\hat{c}_i = F_{lay}^{(i)} \odot F_{sem}^{(i)} for iNoi \leq N_o, cnullc_{null} otherwise.
  • Mask Mij=1M_{ij} = 1 if tokens i,ji,j belong to same object, -\infty otherwise.

Diffusion Objective and Total Loss

  • U-Net noise: ε^θ=εθ(zt,t,τ^)+Gϕ(FQ)\hat{\varepsilon}_\theta = \varepsilon_\theta(z_t, t, \hat{\tau}') + G_\phi(F_Q).
  • Latent diffusion loss:

LLDM=Ez,εN(0,I),tεε^θ22L_{LDM} = \mathbb{E}_{z, \varepsilon \sim \mathcal{N}(0, I), t} \|\varepsilon - \hat{\varepsilon}_\theta\|_2^2

  • Total loss:

Ltotal=λ1LLDM+λ2LDLCL_{total} = \lambda_1 L_{LDM} + \lambda_2 L_{DLC}

(empirically λ1=λ2=1\lambda_1=\lambda_2=1.)

5. Training Details

  • Base: Stable Diffusion v1.5 + ControlNet.
  • Image encoder: OpenCLIP ViT-H/14 (frozen).
  • Optimizer: AdamW with decoupled weight decay.
  • Default SD-1.5 hyperparameters (learning rate 1×105\approx 1\times10^{-5}, 250K steps).
  • Training set: all 34.2K GenCAMO-DB images.
  • DLCG & AMA heads fine-tuned for 50K steps.
  • Compute: 8×\timesA100 (40GB) GPUs, \sim3 days.

6. Experimental Results and Evaluation

Camouflage Image-Mask Generation (CIG)

  • Metrics: FID (Fréchet Inception Distance), KID, on camouflaged, salient, general categories.
  • Baselines: CI, DCI, LCGNet, LDM, LAKERED, “Camouflage Anything”, MIP-Adapter.
  • GenCAMO (I+T+D) achieves best FID = 18.49, KID = 0.0123.

Synthetic-to-Real Dense Prediction (S2RCDP)

  • RGB/Depth COD: SINet, SINet-v2, RISNet+CSRDA. GenCAMO (synthetic+\,+\,real) improves S-measure, F-measure, E-measure; MAE drops \sim0.065 → \sim0.043.
  • Open-Vocabulary Camouflage Segmentation (OVCOS): OVCamo model.
    • GenCAMO only: cSmcS_m = 0.579, cFvβcF_v^\beta = 0.490.
    • Real+GenCAMO: cSmcS_m = 0.589, cFvβcF_v^\beta = 0.518 (best results).

Ablation

  • Baseline (no DLCG/AMA): FID = 54.32, KID = 0.0239.
  • AMA only: 43.45/0.0172.
  • DLCG only: 42.57/0.0192.
  • DLCG+AMA: 38.45/0.0123.
  • Qualitative: improved semantic alignment, depth-consistent geometry, smoother occlusion blending, more accurate segmentation masks.

7. Limitations and Future Directions

Observed limitations include local artifacts under complex illumination (e.g., metallic glare, red goggles) and difficulty in accurate rendering under multiple light sources or hard shadows. Directions for future work involve incorporation of physics-aware lighting priors, higher-resolution scene-graph reasoning, and improved cross-modal feature alignment.


GenCAMO synthesizes advances in dataset scale and semantic richness (GenCAMO-DB) with a mask-free, multi-modal, scene-graph-guided diffusion generative architecture, providing superior performance and generalization for camouflage image synthesis and downstream dense prediction benchmarks (Chen et al., 3 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to GenCAMO.