Marigold Framework for Dense Prediction

Updated 31 January 2026

Marigold framework is a method that repurposes latent diffusion models from text-to-image synthesis for dense image tasks like depth estimation and surface normals prediction.
It adapts generative pretraining by fine-tuning only the denoising UNet while leveraging geometric and photometric priors, achieving robust zero-shot generalization on standard benchmarks.
The approach yields state-of-the-art performance on multiple modalities using synthetic data, addressing challenges such as monocular depth ambiguity and sparse observations.

The Marigold framework is an approach for repurposing large-scale latent diffusion models—originally trained for text-to-image synthesis (e.g., Stable Diffusion)—to perform dense image analysis tasks such as monocular depth estimation, depth completion, surface normals prediction, and intrinsic image decomposition. Marigold extracts and adapts the geometric and photometric priors learned during generative pretraining and demonstrates state-of-the-art zero-shot generalization on standard benchmarks using only synthetic data for fine-tuning while requiring minimal architectural modifications.

1. Theoretical Motivations and Overview

Monocular depth estimation is fundamentally ill-posed, as a single image can correspond to an infinite set of plausible 3D structures. Prior deep learning methods overcome this ambiguity by learning dataset-specific priors—such as object sizes, perspective cues, and occlusion relationships—but generalization to novel domains remains challenging. Marigold addresses this by leveraging the strong 3D priors implicit in latent diffusion models, which acquire scene semantics and geometric understanding via large-scale pretraining on internet image–text pairs (Ke et al., 2023, Ke et al., 14 May 2025).

By fine-tuning only the denoising UNet of a pretrained LDM (with the VAE frozen), Marigold transfers its rich internal representations toward dense prediction tasks while retaining out-of-distribution robustness. The diffusion-based iterative denoising process forces the network to infer plausible 3D scene layouts at every step, enabling more consistent results compared to direct regression approaches.

2. Core Architecture and Modality Conditioning

At its core, Marigold makes use of a pretrained latent diffusion model comprising a VAE encoder/decoder and a denoising UNet $\epsilon_\theta$ . The VAE is modality-agnostic and able to encode any complete, noise-free 3-channel image (RGB, depth, normals, or intrinsic decomposition outputs) into a latent $z$ space. For single-channel modalities (e.g., depth), the map is replicated to three channels before encoding. The UNet is adapted for conditional generation via channel-wise concatenation: noisy latent for the target modality $z_t^{(d)}$ and fixed RGB image latent $z^{(x)}$ are concatenated along the channel dimension. The first convolutional layer is widened accordingly and initialized by duplicating and halving weights to maintain activation statistics (Ke et al., 14 May 2025).

The forward diffusion process for the target modality follows:

$z^{(d)}_t = \sqrt{\bar\alpha_t} z^{(d)}_0 + \sqrt{1 - \bar\alpha_t}\,\epsilon,\quad \epsilon \sim \mathcal{N}(0, I).$

The conditional UNet receives $(z^{(d)}_t, z^{(x)}, t)$ and predicts noise to reverse the diffusion, optimizing the DDPM objective:

$\mathcal{L}_{\text{diff}}(\theta) = \mathbb{E}_{z^{(d)}_0, \epsilon, t} \bigl\|\, \epsilon - \epsilon_\theta(z^{(d)}_t, z^{(x)}, t)\bigr\|^2_2.$

During inference for monocular depth, the final latent $z_0^{(d)}$ is decoded and aligned to metric space via affine correction:

$D(x) = a\, m(x) + b, \qquad (a,b) = \arg\min_{s,t} \sum_i (s\,m_i + t - d_i^\text{gt})^2\,.$

3. Training Protocols for Dense Prediction

Marigold is notable for relying exclusively on high-quality synthetic datasets (e.g., Hypersim and Virtual KITTI for depth; InteriorVerse, Sintel, HyperSim for surface normals and intrinsic prediction) (Ke et al., 14 May 2025). Real depth maps are eschewed due to their typical incompleteness and noise. Target modality maps are normalized (affine-invariant for depth) and manipulated as modality-appropriate, then fed to the frozen VAE for encoding.

Fine-tuning is limited to the UNet using standard Adam, moderate learning rates ( $\sim3\times10^{-5}$ – $z$ 0), batch size 32, and typical augmentations (horizontal flip, color jitter, blur). Annealed multi-resolution noise schedules are employed for convergence speed and accuracy. Training is affordable, typically requiring 1–2.5 days on a single consumer GPU (e.g., RTX 4090).

4. Depth Completion: Marigold-DC and SteeredMarigold

Marigold-DC reframes depth completion as dense image-conditioned depth synthesis in latent space, informed by sparse depth observations cast as test-time guidance. The approach optimizes the depth latent and affine scale/shift parameters during iterative diffusion, anchored to sparse measurements using a per-step guidance loss:

$z$ 1

where $z$ 2 defines pixels with valid sparse depth (Viola et al., 2024).

SteeredMarigold employs plug-and-play steering by modifying the sample at each reverse diffusion step using both known and space-filling pseudo-observations, updating the generated depth relative to encoded sparse anchors via a soft correction in latent space (Gregorek et al., 2024). Both methods demonstrate robustness to irregular, extreme sparsity and large missing regions.

5. Experimental Evaluation and Zero-Shot Generalization

Comprehensive evaluations are performed in zero-shot regime: training solely on synthetic data, models are tested on real benchmarks (NYUv2, KITTI, ETH3D, ScanNet, DIODE). Metrics include AbsRel, $z$ 3 accuracy (depth), mean angular error and $z$ 4 (normals), and PSNR/SSIM/LPIPS (intrinsic). Marigold achieves state-of-the-art or superior performance across all tested modalities.

Depth results:

NYUv2: Marigold-Depth ensemble achieves AbsRel = 5.5% vs HDN’s 6.9–9.8%, often with $z$ 520% relative improvement on select datasets (Ke et al., 2023, Ke et al., 14 May 2025).
KITTI: Performance matches top affine-invariant methods (AbsRel $z$ 6 10%) without real depth training.

Surface normals:

Marigold-Normals achieves lowest mean angular error and highest $z$ 7 on most benchmarks, outperforming methods with much larger real training sets.

Depth completion (Marigold-DC, SteeredMarigold):

NYUv2, ScanNet, VOID, iBims-1, KITTI DC, DDAD: Marigold-DC with ensemble yields best or near-best MAE/RMSE.
SteeredMarigold is robust against large central missing regions (NYUv2 RMSE = 0.19–0.26 m, when competing methods degrade to RMSE $z$ 8 1.3 m) (Gregorek et al., 2024).

Intrinsic decomposition:

Marigold-IID models match or exceed published diffusion-based IID methods on appearance and lighting benchmarks.

6. Limitations and Practical Considerations

Key limitations include run-time overhead (diffusion inference is slower than feed-forward nets; ensembling increases cost), stochastic outputs requiring ensemble for stability, reliance on high-quality synthetic training data (modalities lacking realistic simulators remain challenging), and resolution biases due to base LDMs (often $z$ 91 kpx). Metric depth prediction remains dependent on external cues; all outputs are affine-invariant by design (Ke et al., 2023, Ke et al., 14 May 2025). SteeredMarigold’s soft steering does not enforce hard depth constraints; optimizing explicit constraint layers is an open direction.

7. Extensions and Future Directions

Research directions include:

Accelerating inference via diffusion distillation (e.g., 2–4 step emulators, LCM distillation).
Hybrid approaches combining Marigold’s priors with camera-intrinsic estimation for metric scale recovery.
Multi-view geometric consistency constraints to further refine spatial predictions.
Multi-task fine-tuning to extend cross-modal consistency for segmentation, instance masks, or video prediction.
Hard-constrained steering for depth completion to enforce anchor fidelity at inference (Ke et al., 2023, Gregorek et al., 2024, Viola et al., 2024, Ke et al., 14 May 2025).

A plausible implication is that the generative priors embedded in pretrained LDMs can unify a wide class of dense vision problems under conditional diffusion, demonstrating strong generalization from limited synthetic supervision. This suggests rethinking classical depth completion and monocular dense prediction as conditional generation anchored to sparse observations.

Markdown Report Issue Upgrade to Chat

References (4)

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (2023)

Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis (2025)

Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion (2024)

SteeredMarigold: Steering Diffusion Towards Depth Completion of Largely Incomplete Depth Maps (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Marigold Framework.

Marigold Framework for Dense Prediction

1. Theoretical Motivations and Overview

2. Core Architecture and Modality Conditioning

3. Training Protocols for Dense Prediction

4. Depth Completion: Marigold-DC and SteeredMarigold

5. Experimental Evaluation and Zero-Shot Generalization

6. Limitations and Practical Considerations

7. Extensions and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Marigold Framework for Dense Prediction

1. Theoretical Motivations and Overview

2. Core Architecture and Modality Conditioning

3. Training Protocols for Dense Prediction

4. Depth Completion: Marigold-DC and SteeredMarigold

5. Experimental Evaluation and Zero-Shot Generalization

6. Limitations and Practical Considerations

7. Extensions and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research