Papers
Topics
Authors
Recent
Search
2000 character limit reached

Marigold Framework for Dense Prediction

Updated 31 January 2026
  • Marigold framework is a method that repurposes latent diffusion models from text-to-image synthesis for dense image tasks like depth estimation and surface normals prediction.
  • It adapts generative pretraining by fine-tuning only the denoising UNet while leveraging geometric and photometric priors, achieving robust zero-shot generalization on standard benchmarks.
  • The approach yields state-of-the-art performance on multiple modalities using synthetic data, addressing challenges such as monocular depth ambiguity and sparse observations.

The Marigold framework is an approach for repurposing large-scale latent diffusion models—originally trained for text-to-image synthesis (e.g., Stable Diffusion)—to perform dense image analysis tasks such as monocular depth estimation, depth completion, surface normals prediction, and intrinsic image decomposition. Marigold extracts and adapts the geometric and photometric priors learned during generative pretraining and demonstrates state-of-the-art zero-shot generalization on standard benchmarks using only synthetic data for fine-tuning while requiring minimal architectural modifications.

1. Theoretical Motivations and Overview

Monocular depth estimation is fundamentally ill-posed, as a single image can correspond to an infinite set of plausible 3D structures. Prior deep learning methods overcome this ambiguity by learning dataset-specific priors—such as object sizes, perspective cues, and occlusion relationships—but generalization to novel domains remains challenging. Marigold addresses this by leveraging the strong 3D priors implicit in latent diffusion models, which acquire scene semantics and geometric understanding via large-scale pretraining on internet image–text pairs (Ke et al., 2023, Ke et al., 14 May 2025).

By fine-tuning only the denoising UNet of a pretrained LDM (with the VAE frozen), Marigold transfers its rich internal representations toward dense prediction tasks while retaining out-of-distribution robustness. The diffusion-based iterative denoising process forces the network to infer plausible 3D scene layouts at every step, enabling more consistent results compared to direct regression approaches.

2. Core Architecture and Modality Conditioning

At its core, Marigold makes use of a pretrained latent diffusion model comprising a VAE encoder/decoder and a denoising UNet ϵθ\epsilon_\theta. The VAE is modality-agnostic and able to encode any complete, noise-free 3-channel image (RGB, depth, normals, or intrinsic decomposition outputs) into a latent zz space. For single-channel modalities (e.g., depth), the map is replicated to three channels before encoding. The UNet is adapted for conditional generation via channel-wise concatenation: noisy latent for the target modality zt(d)z_t^{(d)} and fixed RGB image latent z(x)z^{(x)} are concatenated along the channel dimension. The first convolutional layer is widened accordingly and initialized by duplicating and halving weights to maintain activation statistics (Ke et al., 14 May 2025).

The forward diffusion process for the target modality follows:

zt(d)=αˉtz0(d)+1αˉtϵ,ϵN(0,I).z^{(d)}_t = \sqrt{\bar\alpha_t} z^{(d)}_0 + \sqrt{1 - \bar\alpha_t}\,\epsilon,\quad \epsilon \sim \mathcal{N}(0, I).

The conditional UNet receives (zt(d),z(x),t)(z^{(d)}_t, z^{(x)}, t) and predicts noise to reverse the diffusion, optimizing the DDPM objective:

Ldiff(θ)=Ez0(d),ϵ,tϵϵθ(zt(d),z(x),t)22.\mathcal{L}_{\text{diff}}(\theta) = \mathbb{E}_{z^{(d)}_0, \epsilon, t} \bigl\|\, \epsilon - \epsilon_\theta(z^{(d)}_t, z^{(x)}, t)\bigr\|^2_2.

During inference for monocular depth, the final latent z0(d)z_0^{(d)} is decoded and aligned to metric space via affine correction:

D(x)=am(x)+b,(a,b)=argmins,ti(smi+tdigt)2.D(x) = a\, m(x) + b, \qquad (a,b) = \arg\min_{s,t} \sum_i (s\,m_i + t - d_i^\text{gt})^2\,.

3. Training Protocols for Dense Prediction

Marigold is notable for relying exclusively on high-quality synthetic datasets (e.g., Hypersim and Virtual KITTI for depth; InteriorVerse, Sintel, HyperSim for surface normals and intrinsic prediction) (Ke et al., 14 May 2025). Real depth maps are eschewed due to their typical incompleteness and noise. Target modality maps are normalized (affine-invariant for depth) and manipulated as modality-appropriate, then fed to the frozen VAE for encoding.

Fine-tuning is limited to the UNet using standard Adam, moderate learning rates (3×105\sim3\times10^{-5}6×1056\times10^{-5}), batch size 32, and typical augmentations (horizontal flip, color jitter, blur). Annealed multi-resolution noise schedules are employed for convergence speed and accuracy. Training is affordable, typically requiring 1–2.5 days on a single consumer GPU (e.g., RTX 4090).

4. Depth Completion: Marigold-DC and SteeredMarigold

Marigold-DC reframes depth completion as dense image-conditioned depth synthesis in latent space, informed by sparse depth observations cast as test-time guidance. The approach optimizes the depth latent and affine scale/shift parameters during iterative diffusion, anchored to sparse measurements using a per-step guidance loss:

L(zt,s,b)=(i,j)Ωsϕd(x^0)i,j+bCi,j+(sϕd(x^0)i,j+bCi,j)2,\mathcal{L}(z_t, s, b) = \sum_{(i,j)\in\Omega} \big|\,s \phi_d(\hat{x}_0)_{i,j} + b - C_{i,j} \bigr| + \big(s \phi_d(\hat{x}_0)_{i,j} + b - C_{i,j}\bigr)^2,

where Ω\Omega defines pixels with valid sparse depth (Viola et al., 2024).

SteeredMarigold employs plug-and-play steering by modifying the sample at each reverse diffusion step using both known and space-filling pseudo-observations, updating the generated depth relative to encoded sparse anchors via a soft correction in latent space (Gregorek et al., 2024). Both methods demonstrate robustness to irregular, extreme sparsity and large missing regions.

5. Experimental Evaluation and Zero-Shot Generalization

Comprehensive evaluations are performed in zero-shot regime: training solely on synthetic data, models are tested on real benchmarks (NYUv2, KITTI, ETH3D, ScanNet, DIODE). Metrics include AbsRel, δ1\delta_1 accuracy (depth), mean angular error and %<11.25\% < 11.25^\circ (normals), and PSNR/SSIM/LPIPS (intrinsic). Marigold achieves state-of-the-art or superior performance across all tested modalities.

Depth results:

  • NYUv2: Marigold-Depth ensemble achieves AbsRel = 5.5% vs HDN’s 6.9–9.8%, often with \sim20% relative improvement on select datasets (Ke et al., 2023, Ke et al., 14 May 2025).
  • KITTI: Performance matches top affine-invariant methods (AbsRel \approx 10%) without real depth training.

Surface normals:

  • Marigold-Normals achieves lowest mean angular error and highest %<11.25\% < 11.25^\circ on most benchmarks, outperforming methods with much larger real training sets.

Depth completion (Marigold-DC, SteeredMarigold):

  • NYUv2, ScanNet, VOID, iBims-1, KITTI DC, DDAD: Marigold-DC with ensemble yields best or near-best MAE/RMSE.
  • SteeredMarigold is robust against large central missing regions (NYUv2 RMSE = 0.19–0.26 m, when competing methods degrade to RMSE \approx 1.3 m) (Gregorek et al., 2024).

Intrinsic decomposition:

  • Marigold-IID models match or exceed published diffusion-based IID methods on appearance and lighting benchmarks.

6. Limitations and Practical Considerations

Key limitations include run-time overhead (diffusion inference is slower than feed-forward nets; ensembling increases cost), stochastic outputs requiring ensemble for stability, reliance on high-quality synthetic training data (modalities lacking realistic simulators remain challenging), and resolution biases due to base LDMs (often <<1 kpx). Metric depth prediction remains dependent on external cues; all outputs are affine-invariant by design (Ke et al., 2023, Ke et al., 14 May 2025). SteeredMarigold’s soft steering does not enforce hard depth constraints; optimizing explicit constraint layers is an open direction.

7. Extensions and Future Directions

Research directions include:

  • Accelerating inference via diffusion distillation (e.g., 2–4 step emulators, LCM distillation).
  • Hybrid approaches combining Marigold’s priors with camera-intrinsic estimation for metric scale recovery.
  • Multi-view geometric consistency constraints to further refine spatial predictions.
  • Multi-task fine-tuning to extend cross-modal consistency for segmentation, instance masks, or video prediction.
  • Hard-constrained steering for depth completion to enforce anchor fidelity at inference (Ke et al., 2023, Gregorek et al., 2024, Viola et al., 2024, Ke et al., 14 May 2025).

A plausible implication is that the generative priors embedded in pretrained LDMs can unify a wide class of dense vision problems under conditional diffusion, demonstrating strong generalization from limited synthetic supervision. This suggests rethinking classical depth completion and monocular dense prediction as conditional generation anchored to sparse observations.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Marigold Framework.