Depth-Conditioned Generation Capability

Updated 17 December 2025

Depth-conditioned generation capability is a technique that uses depth information (e.g., depth maps, tokens) to control image and scene synthesis, ensuring geometric consistency and precise spatial arrangements.
Architectural designs such as dual-path networks, joint RGBD modeling, and token-based fusion enable efficient depth signal injection and cross-modal attention, improving synthesis quality.
Challenges include accurate depth supervision, occlusion reasoning, and scalability, with future research focusing on unified joint models and robust multimodal fusion techniques.

Depth-conditioned generation capability refers to the explicit use of depth information—typically in the form of monocular or multi-view depth maps, depth tokens, or joint RGBD representations—as a controlling or guiding signal in image or scene synthesis models. This paradigm enables fine-grained spatial arrangement, geometric consistency, and controllable 3D content in generated outputs across diverse architectures including GANs, diffusion models, and autoregressive transformers. Depth serves as an expressive conditioning modality, surpassing semantic or edge cues in its ability to encode linear and planar structure, instance layout, and 3D correspondence.

1. Architectural Designs for Depth Conditioning

Depth conditioning architectures fall into several categories, each with mechanisms for injecting, fusing, or jointly modeling depth signals:

Dual-path networks and feature fusion: DepthGAN leverages a two-path generator, with a dedicated depth synthesis branch whose multi-scale features guide each block of the RGB appearance subnetwork via channel-wise fusion. The switchable discriminator supports both RGBD real/fake classification and depth prediction (Shi et al., 2022).
Joint RGBD modeling via diffusion transformers: JointDiT models the full joint distribution $p_\theta(x,d)$ with modality-specific noise schedules and adaptive scheduling weights, dynamically controlling cross-modal attention based on relative noise levels. Depth-to-image generation is accomplished by fixing the depth branch at zero noise and sampling the RGB branch (Byung-Ki et al., 1 May 2025).
Token-based multimodal fusion: ContextAR encodes depth maps as VQ-VAE tokens with dedicated embeddings and hybrid positional encodings (RoPE+LPE), embedding all conditions directly in a unified transformer input sequence. Conditional Context-aware Attention ensures intra-condition perception and efficient compute (Chen et al., 18 May 2025).
GANs with depth-branch spatial mixing: StyLandGAN and GMPI (Generative Multiplane Images) extend StyleGAN to integrate spatial depth maps or multiplane alpha maps conditioned on explicit depths, thereby preserving view consistency and geometric structure (Lee et al., 2022, Zhao et al., 2022).
Decoupled multi-instance synthesis: 3DIS separates depth-based instance positioning and attribute rendering, generating coarse composite depth layouts followed by detail rendering for each instance using finetuning-free depth-conditioned ControlNet injection (Zhou et al., 2024).

2. Mathematical Formulations and Training Objectives

Depth conditioning is operationalized through specialized loss functions, noise scheduling, and guidance mechanisms:

Joint diffusion and flow-matching objectives: JointDiT and UniCon introduce joint noise processes for each modality, optimizing model vector fields via a Joint Conditional Flow Matching loss, and at inference, control the noise schedule to condition on depth (Byung-Ki et al., 1 May 2025, Li et al., 2024).
Classifier-free and pseudo-label guidance: DAG extracts intermediate features from frozen DDPMs for label-efficient depth prediction, then adds gradients of pseudo-label depth consistency and depth-domain priors during sampling to steer the generation towards geometric realism (Kim et al., 2022).
Score Distillation Sampling (SDS) with depth control: EucliDreamer and Control3D-IP optimize SDS objectives where rendered mesh depth maps are injected as conditioning channels at every U-Net scale, enabling backpropagation through both the diffusion model and the differentiable renderer (Le et al., 2024, Lee et al., 27 Nov 2025).
Per-label token embedding and transformer fusion: Spatially Multi-conditional Image Generation employs affine GeLU embeddings for each per-pixel spatial label (including depth), merging via pixel-wise self-attention and averaging to produce a concept tensor for GAN synthesis (Chakraborty et al., 2022).

3. Controllability, Manipulability, and Editing

Depth conditioning confers a spectrum of control mechanisms over generation:

Direct geometry manipulation: Varying the input depth map in Control3D-IP or EucliDreamer directly morphs the shape or extrusion of generated objects, demonstrating that models faithfully use depth to control 3D geometry (Lee et al., 27 Nov 2025, Le et al., 2024).
Generalized proxy-depth control: LooseControl introduces flexible predicates on scene boundaries or 3D boxes, converting rough user-supplied layouts into proxy depth maps that serve as upper bounds or object location constraints. LoRA-based fine-tuning enables robust adherence to these loose controls while preserving diversity (Bhat et al., 2023).
Instance-level composition and fusion: 3DIS utilizes SAM-extracted masks from depth-based layouts to prevent attribute bleeding and guarantee spatial layering in multi-instance compositions. The rendering pipeline merges instance-wise features via softmax fusion, supporting both global and local editing without further finetuning (Zhou et al., 2024).
Style and attribute editing: Key-value locking in LooseControl and cross-attention masking in Compose & Conquer preserve overall style while enabling localized changes in instance geometry or semantics, supporting both 3D box editing and latent-space attribute directionality (Bhat et al., 2023, Lee et al., 2024).

4. Quantitative and Qualitative Evaluation Metrics

State-of-the-art models measure depth-conditioned generation using geometry-aware and perceptual metrics:

Method	FID (RGB)	FID (Depth)	Depth MAE	AbsRel (%)	Human Pref. (%)
DepthGAN	4.80	17.14	N/A	N/A	N/A
JointDiT	12.62	N/A	N/A	6.99	30.73
UniCon	13.21	N/A	0.0990	9.26	N/A
ContextAR	N/A	N/A	165.56	N/A	N/A
3DIS	23.2	N/A	N/A	N/A	N/A

Metrics such as FID, AbsRel, depth MAE, and CLIP similarity characterize photorealism, depth accuracy, and semantic correspondence.
Qualitative studies highlight consistent depth gradients, geometric coherence under pose changes, and mask-based layer ordering.
Human preference evaluations consistently favor JointDiT, LooseControl, and EucliDreamer outputs over baselines, particularly for their geometric faithfulness and controllability.

5. Limitations and Open Challenges

While depth conditioning enhances spatial controllability, models face several limitations:

Depth supervision quality: The fidelity of generated geometry hinges on the accuracy and coverage of depth predictors (MiDaS, Depth-Anything), which often miss small objects or deliver pseudo-ground-truth lacking metric scale (Shi et al., 2022, Lee et al., 2024).
Complex mask and occlusion reasoning: MPI and multiplane representations are limited in occlusion modeling and non-Lambertian phenomena, with residual aliasing at silhouette boundaries (Zhao et al., 2022).
Limited control space: Most frameworks restrict pose control to fixed axes or limited range; full 6-DOF control and dynamic scene content remain open avenues (Lee et al., 27 Nov 2025).
Scalability: Larger joint models like JointDiT introduce parameter and inference overhead, with some trade-off between sample speed and modularity (Byung-Ki et al., 1 May 2025).
Sparse and multimodal fusion: Efficient handling of missing, noisy, or partial depth information, and robust fusion with semantic or style conditions, is still an active research area (Chakraborty et al., 2022, Chen et al., 18 May 2025).

6. Extensions, Adaptability, and Future Prospects

Recent trends point towards broader and more flexible deployment of depth-conditioned generation:

Unified joint modeling: JointDiT and UniCon demonstrate that a single model can enable RGB-to-depth, depth-to-image, and joint RGBD synthesis by adjusting noise schedules and cross-modal fusion, obviating the need for modality-specific adapters (Byung-Ki et al., 1 May 2025, Li et al., 2024).
Finetuning-free and adapter-based integration: 3DIS proves that coarse depth layouts and instance-wise rendering can be achieved without additional training, leveraging finetuning-free detail renderers and pretrained backbones (Zhou et al., 2024).
Multi-condition, token-based autoregression: ContextAR embeds arbitrary spatial conditions, including depth, in a flexible token stream, supporting seamless stacking of control modalities and robust handling through hybrid positional encoding and attention masks (Chen et al., 18 May 2025).
Editing and compositionality workflows: LooseControl and Compose & Conquer make interactive scene editing (style-preserving, geometry-editing, attribute control) tractable for designers through proxy depth predicates, cross-attention gating, and disentangled depth streams (Bhat et al., 2023, Lee et al., 2024).

These advances position depth-conditioned generation as a key enabler for 3D-aware content synthesis, controllable scene composition, and automated geometric design across image, video, and 3D domains. The versatility and modularity of depth conditioning—be it through direct fusion, joint modeling, or compositional editing—are driving ongoing research towards richer geometric realism, scalable control, and cross-modal adaptability.