Papers
Topics
Authors
Recent
Search
2000 character limit reached

Color-Conditioned Depth & Normal Generation

Updated 22 May 2026
  • Color-conditioned depth and normal generation is a process that recovers dense per-pixel geometry aligned with RGB imagery.
  • It leverages diffusion models, latent VAEs, and cross-modal attention to enforce spatial coherence and structural consistency among color, depth, and normal predictions.
  • The technique underpins applications like text-to-3D synthesis, monocular geometry prediction, and image inpainting, achieving state-of-the-art performance on multiple benchmarks.

Color-conditioned depth and normal generation refers to the process of producing scene geometry representations—specifically depth maps (D) and surface normal fields (N)—that are spatially coherent and structurally aligned with a reference color (RGB) image. This paradigm is central to monocular geometry estimation, text-to-3D synthesis, image-to-geometry inpainting, neural rendering, and perception systems operating in complex or degenerate conditions (e.g., transparency, sparse depth).

1. Problem Formulation and Motivation

Color-conditioned geometry generation aims to recover, from a single RGB image, a dense per-pixel depth map and surface normal field that are structurally consistent with the image’s content. In recent advances, diffusion-based models, latent-variable VAEs, and cross-modal attention mechanisms have supplanted earlier deterministic regression approaches. The principal challenge is to ensure tight spatial alignment (“cross-modal consistency”) and intrinsic coherence among color, depth, and normal predictions, by leveraging their natural mutual dependencies.

Unlike traditional models that estimate depth or normals independently from images, leading to potential geometric and photometric misalignment, joint or color-conditioned models aim to fuse color and geometry in a unified, structured manner (Krishnan et al., 22 Jan 2025, Qiu et al., 2023, Han et al., 2 Dec 2025, Qiu et al., 2018, Xu et al., 29 Dec 2025).

2. Diffusion and Latent Variable Architectures for Joint Generation

State-of-the-art diffusion-based frameworks approach this problem by (1) encoding images, depth, and normals into a shared latent space using a VAE and (2) defining a generative (denoising) process in this space, conditioned on the color latent. The diffusion process can target single frames or video sequences and is trained by stochastic gradient descent on a mean-squared noise/velocity matching objective.

Key model and training elements

Model Conditioning Pathway Joint Modeling
Orchid (Krishnan et al., 22 Jan 2025) VAE joint encoding x_c, x_d, x_n LDM in shared z
RichDreamer (Qiu et al., 2023) Latent concat + U-Net Text or image→(N,D)
LumiX (Han et al., 2 Dec 2025) Color latent input, QBA self-attn Multi-map U-Net
DKT (Xu et al., 29 Dec 2025) Video VAE, DiT, concat RGB+depth Spatiotemporal
DeepLiDAR (Qiu et al., 2018) Parallel fusion (convnet) Normal-guided

Orchid, RichDreamer, and LumiX introduce multi-modal or multi-map latent diffusion, allowing zero-shot or image-conditioned generation of geometry. In these models, the color latent may be injected into the diffusion U-Net by concatenation, channel-wise fusion in early layers, or via cross-attention. Temporal methods (DKT) extend this principle to video and transparent or reflective domains.

3. Cross-Modal Conditioning and Attention Mechanisms

Color guidance is critical to ensure that depth and normal estimates respect image boundaries, appearance cues, and semantic edges. Several conditioning strategies have been formalized:

These mechanisms collectively enforce edge alignment, cross-modal structural consistency, and improve appearance-geometry correlation.

4. Loss Functions and Training Objectives

Training objectives are structured to penalize cross-modal inconsistency, enforce geometric plausibility, and regularize statistical priors:

  • Denoising/diffusion loss:

Ldiff=Et,ϵ,x0ϵϵθ(xt,t,cond)2\mathcal{L}_{\rm diff} = \mathbb{E}_{t,\epsilon,x_0} \|\epsilon - \epsilon_\theta(x_t, t, \text{cond})\|^2

for all map modalities, where the conditioning can be textual, color, or camera-based.

  • KL and Latent Regularization: VAE latent KL divergence and, where applicable, latent-distillation loss to maintain VAE stability (Krishnan et al., 22 Jan 2025).
  • Geometry-Consistency and Edge Losses: Additional depth-normal orthogonality term EpD(p)N(p)\mathbb{E}_p|\nabla D(p)\cdot N(p)| and unit-normal constraint Ep(N(p)21)2\mathbb{E}_p(\|N(p)\|_2-1)^2 (Han et al., 2 Dec 2025).
  • Appearance-Guided Losses: For color-conditioned learning, RGB and albedo priors are imposed through further diffusion models (Qiu et al., 2023).
  • Inpainting and Masked Losses: Latent- or pixel-space masking for joint inpainting, e.g., RePaint-style, without the need for retraining (Krishnan et al., 22 Jan 2025).

Loss aggregation weights and multi-stage training schedules are tuned to enforce stable reconstruction and guide the intermediate representations (e.g., normals in DeepLiDAR (Qiu et al., 2018)).

5. Applications: 3D Reconstruction, Perception, and Inpainting

Color-conditioned depth and normal generation models are foundational for:

  • Text-to-3D synthesis (Qiu et al., 2023, Han et al., 2 Dec 2025): Enabling pipelines to leverage language and image priors to jointly produce geometry and appearance for 3D object/scene modeling, with downstream mesh extraction and PBR material optimization.
  • Monocular geometry prediction (Krishnan et al., 22 Jan 2025, Qiu et al., 2018): Providing dense per-pixel geometry for indoor, outdoor, and transparent scenes, with state-of-the-art zero-shot depth and normal accuracy.
  • Intrinsic decomposition (Han et al., 2 Dec 2025, Qiu et al., 2023): Consistent estimation of albedo, normal, and depth maps from color images, useful for rendering, relighting, and image editing.
  • Video and transparency perception (Xu et al., 29 Dec 2025): Robust temporally-consistent geometry prediction for transparent and reflective materials, leveraged in manipulation, grasping, and robotics.
  • Image inpainting and completion (Krishnan et al., 22 Jan 2025): Joint infilling of missing color, depth, and normal regions, guided by the intrinsic structural relationships learned by the models.

A plausible implication is that advances in diffusion-based joint modeling have surpassed task-specific regression in both accuracy and cross-modal consistency, especially in under-constrained settings (e.g., sparse, occluded, or incomplete data).

6. Quantitative Results and Cross-Modal Consistency

Recent models set state-of-the-art metrics on several scene understanding benchmarks:

  • Orchid (Krishnan et al., 22 Jan 2025): AbsRel 5.7%\approx 5.7\%, δ196.9%\delta_1 \approx 96.9\% zero-shot depth on NYUv2, mean normal error 15.215.2^\circ (60.6% <11.25<11.25^\circ), and depth-normal consistency error near $0.04$ (vs m{d,n}m \in \{d,n\}0–m{d,n}m \in \{d,n\}1 prior best).
  • RichDreamer (Qiu et al., 2023): NeRF-optimized geometry CLIP m{d,n}m \in \{d,n\}2, appearance CLIP m{d,n}m \in \{d,n\}3 (state-of-the-art vs prior m{d,n}m \in \{d,n\}417.54/26.41), with LAION pretraining critical to geometric fidelity.
  • LumiX (Han et al., 2 Dec 2025): Reduces RMSE and AbsRel for depth by 38–45% and mean normal error by more than m{d,n}m \in \{d,n\}5 vs baselines (MAE m{d,n}m \in \{d,n\}6 vs m{d,n}m \in \{d,n\}7).
  • DKT (Xu et al., 29 Dec 2025): On ClearPose, REL m{d,n}m \in \{d,n\}8 (vs m{d,n}m \in \{d,n\}9 prior), RMSE Ldiff=Et,ϵ,x0ϵϵθ(xt,t,cond)2\mathcal{L}_{\rm diff} = \mathbb{E}_{t,\epsilon,x_0} \|\epsilon - \epsilon_\theta(x_t, t, \text{cond})\|^20, and temporally stable reconstructions in transparent scenes.

User studies and ablations report that joint modeling, multi-modal latent diffusion, and dedicated cross-modal conditioning each contribute measurable improvements in realism and perceptual quality.

7. Architectural and Algorithmic Innovations

Several innovations uniquely benefit color-conditioned depth and normal generation:

These advances collectively anchor the state of the art in color-conditioned geometric generation, spanning general scenes, transparency, and physically structured decomposition.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Color-Conditioned Depth and Normal Generation.