Color-Conditioned Depth & Normal Generation

Updated 22 May 2026

Color-conditioned depth and normal generation is a process that recovers dense per-pixel geometry aligned with RGB imagery.
It leverages diffusion models, latent VAEs, and cross-modal attention to enforce spatial coherence and structural consistency among color, depth, and normal predictions.
The technique underpins applications like text-to-3D synthesis, monocular geometry prediction, and image inpainting, achieving state-of-the-art performance on multiple benchmarks.

Color-conditioned depth and normal generation refers to the process of producing scene geometry representations—specifically depth maps (D) and surface normal fields (N)—that are spatially coherent and structurally aligned with a reference color (RGB) image. This paradigm is central to monocular geometry estimation, text-to-3D synthesis, image-to-geometry inpainting, neural rendering, and perception systems operating in complex or degenerate conditions (e.g., transparency, sparse depth).

1. Problem Formulation and Motivation

Color-conditioned geometry generation aims to recover, from a single RGB image, a dense per-pixel depth map and surface normal field that are structurally consistent with the image’s content. In recent advances, diffusion-based models, latent-variable VAEs, and cross-modal attention mechanisms have supplanted earlier deterministic regression approaches. The principal challenge is to ensure tight spatial alignment (“cross-modal consistency”) and intrinsic coherence among color, depth, and normal predictions, by leveraging their natural mutual dependencies.

Unlike traditional models that estimate depth or normals independently from images, leading to potential geometric and photometric misalignment, joint or color-conditioned models aim to fuse color and geometry in a unified, structured manner (Krishnan et al., 22 Jan 2025, Qiu et al., 2023, Han et al., 2 Dec 2025, Qiu et al., 2018, Xu et al., 29 Dec 2025).

2. Diffusion and Latent Variable Architectures for Joint Generation

State-of-the-art diffusion-based frameworks approach this problem by (1) encoding images, depth, and normals into a shared latent space using a VAE and (2) defining a generative (denoising) process in this space, conditioned on the color latent. The diffusion process can target single frames or video sequences and is trained by stochastic gradient descent on a mean-squared noise/velocity matching objective.

Key model and training elements

Model	Conditioning Pathway	Joint Modeling
Orchid (Krishnan et al., 22 Jan 2025)	VAE joint encoding x_c, x_d, x_n	LDM in shared z
RichDreamer (Qiu et al., 2023)	Latent concat + U-Net	Text or image→(N,D)
LumiX (Han et al., 2 Dec 2025)	Color latent input, QBA self-attn	Multi-map U-Net
DKT (Xu et al., 29 Dec 2025)	Video VAE, DiT, concat RGB+depth	Spatiotemporal
DeepLiDAR (Qiu et al., 2018)	Parallel fusion (convnet)	Normal-guided

Orchid, RichDreamer, and LumiX introduce multi-modal or multi-map latent diffusion, allowing zero-shot or image-conditioned generation of geometry. In these models, the color latent may be injected into the diffusion U-Net by concatenation, channel-wise fusion in early layers, or via cross-attention. Temporal methods (DKT) extend this principle to video and transparent or reflective domains.

Color guidance is critical to ensure that depth and normal estimates respect image boundaries, appearance cues, and semantic edges. Several conditioning strategies have been formalized:

Channel-wise concatenation: The clean color latent is concatenated with noised depth/normal latents for input into the diffusion backbone (Krishnan et al., 22 Jan 2025, Xu et al., 29 Dec 2025, Han et al., 2 Dec 2025).
Query-Broadcast Attention (QBA): In LumiX, the query tensor from the color branch is broadcasted to depth/normal branches in each attention block. For any UNet self-attention, $H^{(m)\prime} = \mathrm{softmax}(Q^{(c)}K^{(m)\top}/\sqrt{d})V^{(m)}$ for $m \in \{d,n\}$ , forcing geometry to “look at” color’s structural content (Han et al., 2 Dec 2025).
Tensor LoRA: Parameter-efficient low-rank adaptation for modeling all cross-modal attention perturbations, yielding efficient adaptation without catastrophic forgetting (Han et al., 2 Dec 2025).
Multi-view or Camera Embedding: In text-to-3D and multi-view setups, camera extrinsics are embedded and fused with conditioning streams (Qiu et al., 2023).

These mechanisms collectively enforce edge alignment, cross-modal structural consistency, and improve appearance-geometry correlation.

4. Loss Functions and Training Objectives

Training objectives are structured to penalize cross-modal inconsistency, enforce geometric plausibility, and regularize statistical priors:

Denoising/diffusion loss:

$\mathcal{L}_{\rm diff} = \mathbb{E}_{t,\epsilon,x_0} \|\epsilon - \epsilon_\theta(x_t, t, \text{cond})\|^2$

for all map modalities, where the conditioning can be textual, color, or camera-based.

KL and Latent Regularization: VAE latent KL divergence and, where applicable, latent-distillation loss to maintain VAE stability (Krishnan et al., 22 Jan 2025).
Geometry-Consistency and Edge Losses: Additional depth-normal orthogonality term $\mathbb{E}_p|\nabla D(p)\cdot N(p)|$ and unit-normal constraint $\mathbb{E}_p(\|N(p)\|_2-1)^2$ (Han et al., 2 Dec 2025).
Appearance-Guided Losses: For color-conditioned learning, RGB and albedo priors are imposed through further diffusion models (Qiu et al., 2023).
Inpainting and Masked Losses: Latent- or pixel-space masking for joint inpainting, e.g., RePaint-style, without the need for retraining (Krishnan et al., 22 Jan 2025).

Loss aggregation weights and multi-stage training schedules are tuned to enforce stable reconstruction and guide the intermediate representations (e.g., normals in DeepLiDAR (Qiu et al., 2018)).

5. Applications: 3D Reconstruction, Perception, and Inpainting

Color-conditioned depth and normal generation models are foundational for:

Text-to-3D synthesis (Qiu et al., 2023, Han et al., 2 Dec 2025): Enabling pipelines to leverage language and image priors to jointly produce geometry and appearance for 3D object/scene modeling, with downstream mesh extraction and PBR material optimization.
Monocular geometry prediction (Krishnan et al., 22 Jan 2025, Qiu et al., 2018): Providing dense per-pixel geometry for indoor, outdoor, and transparent scenes, with state-of-the-art zero-shot depth and normal accuracy.
Intrinsic decomposition (Han et al., 2 Dec 2025, Qiu et al., 2023): Consistent estimation of albedo, normal, and depth maps from color images, useful for rendering, relighting, and image editing.
Video and transparency perception (Xu et al., 29 Dec 2025): Robust temporally-consistent geometry prediction for transparent and reflective materials, leveraged in manipulation, grasping, and robotics.
Image inpainting and completion (Krishnan et al., 22 Jan 2025): Joint infilling of missing color, depth, and normal regions, guided by the intrinsic structural relationships learned by the models.

A plausible implication is that advances in diffusion-based joint modeling have surpassed task-specific regression in both accuracy and cross-modal consistency, especially in under-constrained settings (e.g., sparse, occluded, or incomplete data).

Recent models set state-of-the-art metrics on several scene understanding benchmarks:

Orchid (Krishnan et al., 22 Jan 2025): AbsRel $\approx 5.7\%$ , $\delta_1 \approx 96.9\%$ zero-shot depth on NYUv2, mean normal error $15.2^\circ$ (60.6% $<11.25^\circ$ ), and depth-normal consistency error near $0.04$ (vs $m \in \{d,n\}$ 0– $m \in \{d,n\}$ 1 prior best).
RichDreamer (Qiu et al., 2023): NeRF-optimized geometry CLIP $m \in \{d,n\}$ 2, appearance CLIP $m \in \{d,n\}$ 3 (state-of-the-art vs prior $m \in \{d,n\}$ 417.54/26.41), with LAION pretraining critical to geometric fidelity.
LumiX (Han et al., 2 Dec 2025): Reduces RMSE and AbsRel for depth by 38–45% and mean normal error by more than $m \in \{d,n\}$ 5 vs baselines (MAE $m \in \{d,n\}$ 6 vs $m \in \{d,n\}$ 7).
DKT (Xu et al., 29 Dec 2025): On ClearPose, REL $m \in \{d,n\}$ 8 (vs $m \in \{d,n\}$ 9 prior), RMSE $\mathcal{L}_{\rm diff} = \mathbb{E}_{t,\epsilon,x_0} \|\epsilon - \epsilon_\theta(x_t, t, \text{cond})\|^2$ 0, and temporally stable reconstructions in transparent scenes.

User studies and ablations report that joint modeling, multi-modal latent diffusion, and dedicated cross-modal conditioning each contribute measurable improvements in realism and perceptual quality.

7. Architectural and Algorithmic Innovations

Several innovations uniquely benefit color-conditioned depth and normal generation:

Joint or multi-map latent diffusion (Krishnan et al., 22 Jan 2025, Han et al., 2 Dec 2025) for improved cross-modal consistency.
Structured attention mechanisms (QBA) (Han et al., 2 Dec 2025) and Tensor LoRA (Han et al., 2 Dec 2025) for efficient and aligned fusion of color and geometry.
Explicit multi-stage or multi-pathway fusion (Qiu et al., 2018), including normal-as-intermediate strategies and adaptive confidence maps for outlier rejection.
Large-scale pretraining on image–geometry pairs (e.g., LAION with monocular priors (Qiu et al., 2023), Omnidata (Krishnan et al., 22 Jan 2025), TransPhy3D (Xu et al., 29 Dec 2025)).
Plug-and-play flexibility: Models serve for image→geometry regression, text→3D, or inpainting and enable efficient adaptation (LoRA), transfer, and domain generalization.

These advances collectively anchor the state of the art in color-conditioned geometric generation, spanning general scenes, transparency, and physically structured decomposition.