GenCAMO: Generative Camouflage Synthesis

Updated 10 January 2026

GenCAMO is a generative, environment-aware framework integrating multi-modal inputs like text, depth, and scene graphs for high-fidelity camouflage synthesis.
It leverages structured scene-graph conditioning and diffusion-based models to overcome limited annotated datasets and improve dense prediction accuracy.
The framework achieves superior performance in camouflage object detection and segmentation, with notable improvements in FID, KID, and other evaluation metrics.

GenCAMO denotes a generative, environment-aware, mask-free framework for high-fidelity camouflage image synthesis and dense annotation, specifically targeting advances in conceal dense prediction (CDP)—notably, RGB-D camouflage object detection and open-vocabulary camouflage object segmentation. The method addresses the challenge of limited large-scale, high-quality annotated camouflage datasets by leveraging structured scene-graph conditioning and multi-modal annotation to train diffusion-based generative models, resulting in improved synthetic and synthetic-to-real dense prediction accuracy on complex camouflage scenes (Chen et al., 3 Jan 2026).

GenCAMO-DB is a large-scale dataset comprising 34,200 images spanning natural, household, agricultural, and industrial scenes. It provides multi-modal annotation including:

RGB images;
Dense, single-channel depth maps (generated by “Depth Anything”);
Scene graphs $G = (O, E)$ with nodes %%%%1%%%% (object categories), edges $E = \{e_{ij}\}$ (relations such as “lies on”, “hides in”) and fine-grained concealment attributes $A = \{a_i\}$ (e.g. color, pattern, material);
Text prompts in structured SVO form including color, texture, spatial, and concealment cues (produced by GPT-4o, human-refined);
Implicit mask annotations, used only for evaluation/refinement.

Images are sourced from COCO-Stuff, Visual Genome, and multiple camouflage/segmentation/saliency datasets (CAMO, COD10K, NC4K, USC12K, LAKERED). Annotation involves a semi-automatic pipeline with human refinement (5–10 minutes per image). The final dataset contains $\sim$ 612,500 words of text and $\sim$ 102,600 scene-graph tuples, with images split for generative model training ( $N = 34.2$ K) and evaluation (e.g., 4040 for “camouflage + salient” training, 12,946 for testing).

2. Overall GenCAMO Architecture and Data Flow

GenCAMO builds upon Stable Diffusion v1.5 and ControlNet, introducing multi-modal conditioning that integrates reference images, text prompts, depth maps, and scene graphs. The dataflow proceeds as follows:

Text prompts $C_p$ are embedded using frozen OpenCLIP ViT-H/14: $E_{CLIP}(C_p)$ .
Reference image $C_r$ encoded by the CLIP-image branch: $F_e$ .
Depth map $C_d$ encoded via a compact CNN (“VisualEnc”): $F_D$ .
Scene graph objects, relations, and attributes produce embeddings $E^o_{emb}$ , $E^e_{emb}$ , $E^a_{emb}$ .
Scene-graph modules:
- Depth-Layout Coherence Guided ControlNet (DLCG)
- Attribute-Aware Mask Attention (AMA)
Cross-attention fuses text, image, depth, and scene-graph tokens.
Diffusion U-Net predicts noise $\varepsilon_\theta(z_t, t, \hat{\tau}, F_Q)$ where ControlNet injects $G_{\phi}(F_Q)$ .
Joint decoders generate the camouflage image, predicted depth map (MSE-trained), and coarse mask (DiffuMask style).
Post-refinement uses “Depth Anything” and SAM2 on generated depth/masks.

This architecture enables mask-free, geometry- and context-conscious generation.

3. Scene-Graph Contextual Decoupling and Mask-Free Generation

Scene-Graph Representation:

Graph $G=(O, E)$ encodes object semantics and spatial relationships, with each node/edge tuple $(a_i, o_i, e_{ij}, o_j, a_j)$ input into a GCN for attribute and context propagation.

Depth-Layout Coherence Guided ControlNet (DLCG):

Depth encoding: $F_D = \text{VisualEnc}(C_d)$ .
Layout encoding: $F_{lay} = \text{LayoutDec}(E^o_{emb} \odot E^e_{emb})$ .
Fusion: $F_Q = F_D + F_{lay} \cdot W^L$ with learned $W^L \in \mathbb{R}^{c \times c}$ .
Learnable tokens $T$ with cross-attention yield scene prototypes $P = \{p_m\}$ .
Depth-layout coherence enforced by minimizing cosine distance from each $F_Q(i)$ to nearest prototype $p_m$ .

Attribute-Aware Mask Attention (AMA):

Semantic encoding: $F_{sem} = \text{SemanticsDec}(E^o_{emb} \odot E^a_{emb})$ .
Object-level: $\hat{c}_i = F_{lay}^{(i)} \odot F_{sem}^{(i)}$ for entities, $c_{null}$ otherwise.
Features $[\tilde{V} \oplus \hat{C} \oplus \hat{E}^a]$ fused under a mask $M_{ij}$ (1 if $i, j$ share entity, $-\infty$ otherwise).
No ground-truth foreground masks are provided; mask decoder uses self-supervised DiffuMask loss, with final masks refined by SAM2.

This design decouples object/scene context and enforces semantic/geometry alignment without explicit mask supervision during training.

4. Mathematical Foundations

Depth-Layout Coherence Loss

Feature fusion: $F_Q = F_D + F_{lay} \cdot W^L$ .
Distance to prototype: $d_i = \min_m [1 - \cos(F_Q(i), p_m)]$ .
Loss: $L_{DLC} = \frac{1}{N} \sum_{i=1}^N d_i$ .

Attribute-Aware Mask Attention & Masking

$\hat{c}_i = F_{lay}^{(i)} \odot F_{sem}^{(i)}$ for $i \leq N_o$ , $c_{null}$ otherwise.
Mask $M_{ij} = 1$ if tokens $i,j$ belong to same object, $-\infty$ otherwise.

Diffusion Objective and Total Loss

U-Net noise: $\hat{\varepsilon}_\theta = \varepsilon_\theta(z_t, t, \hat{\tau}') + G_\phi(F_Q)$ .
Latent diffusion loss:

$L_{LDM} = \mathbb{E}_{z, \varepsilon \sim \mathcal{N}(0, I), t} \|\varepsilon - \hat{\varepsilon}_\theta\|_2^2$

Total loss:

$L_{total} = \lambda_1 L_{LDM} + \lambda_2 L_{DLC}$

(empirically $\lambda_1=\lambda_2=1$ .)

5. Training Details

Base: Stable Diffusion v1.5 + ControlNet.
Image encoder: OpenCLIP ViT-H/14 (frozen).
Optimizer: AdamW with decoupled weight decay.
Default SD-1.5 hyperparameters (learning rate $\approx 1\times10^{-5}$ , 250K steps).
Training set: all 34.2K GenCAMO-DB images.
DLCG & AMA heads fine-tuned for 50K steps.
Compute: 8 $\times$ A100 (40GB) GPUs, $\sim$ 3 days.

6. Experimental Results and Evaluation

Camouflage Image-Mask Generation (CIG)

Metrics: FID (Fréchet Inception Distance), KID, on camouflaged, salient, general categories.
Baselines: CI, DCI, LCGNet, LDM, LAKERED, “Camouflage Anything”, MIP-Adapter.
GenCAMO (I+T+D) achieves best FID = 18.49, KID = 0.0123.

Synthetic-to-Real Dense Prediction (S2RCDP)

RGB/Depth COD: SINet, SINet-v2, RISNet+CSRDA. GenCAMO (synthetic $\,+\,$ real) improves S-measure, F-measure, E-measure; MAE drops $\sim$ 0.065 → $\sim$ 0.043.
Open-Vocabulary Camouflage Segmentation (OVCOS): OVCamo model.
- GenCAMO only: $cS_m$ = 0.579, $cF_v^\beta$ = 0.490.
- Real+GenCAMO: $cS_m$ = 0.589, $cF_v^\beta$ = 0.518 (best results).

Ablation

Baseline (no DLCG/AMA): FID = 54.32, KID = 0.0239.
AMA only: 43.45/0.0172.
DLCG only: 42.57/0.0192.
DLCG+AMA: 38.45/0.0123.
Qualitative: improved semantic alignment, depth-consistent geometry, smoother occlusion blending, more accurate segmentation masks.

7. Limitations and Future Directions

Observed limitations include local artifacts under complex illumination (e.g., metallic glare, red goggles) and difficulty in accurate rendering under multiple light sources or hard shadows. Directions for future work involve incorporation of physics-aware lighting priors, higher-resolution scene-graph reasoning, and improved cross-modal feature alignment.

GenCAMO synthesizes advances in dataset scale and semantic richness (GenCAMO-DB) with a mask-free, multi-modal, scene-graph-guided diffusion generative architecture, providing superior performance and generalization for camouflage image synthesis and downstream dense prediction benchmarks (Chen et al., 3 Jan 2026).

PDF Markdown Chat (Pro)

References (1)

GenCAMO: Scene-Graph Contextual Decoupling for Environment-aware and Mask-free Camouflage Image-Dense Annotation Generation (2026)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to GenCAMO.

GenCAMO: Generative Camouflage Synthesis

2. Overall GenCAMO Architecture and Data Flow

3. Scene-Graph Contextual Decoupling and Mask-Free Generation

4. Mathematical Foundations

Depth-Layout Coherence Loss

Attribute-Aware Mask Attention & Masking

Diffusion Objective and Total Loss

5. Training Details

6. Experimental Results and Evaluation

Camouflage Image-Mask Generation (CIG)

Synthetic-to-Real Dense Prediction (S2RCDP)

Ablation

7. Limitations and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

GenCAMO: Generative Camouflage Synthesis

1. GenCAMO-DB: Multi-Modal Camouflage Dataset

2. Overall GenCAMO Architecture and Data Flow

3. Scene-Graph Contextual Decoupling and Mask-Free Generation

4. Mathematical Foundations

Depth-Layout Coherence Loss

Attribute-Aware Mask Attention & Masking

Diffusion Objective and Total Loss

5. Training Details

6. Experimental Results and Evaluation

Camouflage Image-Mask Generation (CIG)

Synthetic-to-Real Dense Prediction (S2RCDP)

Ablation

7. Limitations and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research