Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction (2409.18124v5)

Published 26 Sep 2024 in cs.CV

Abstract: Leveraging the visual priors of pre-trained text-to-image diffusion models offers a promising solution to enhance zero-shot generalization in dense prediction tasks. However, existing methods often uncritically use the original diffusion formulation, which may not be optimal due to the fundamental differences between dense prediction and image generation. In this paper, we provide a systemic analysis of the diffusion formulation for the dense prediction, focusing on both quality and efficiency. And we find that the original parameterization type for image generation, which learns to predict noise, is harmful for dense prediction; the multi-step noising/denoising diffusion process is also unnecessary and challenging to optimize. Based on these insights, we introduce Lotus, a diffusion-based visual foundation model with a simple yet effective adaptation protocol for dense prediction. Specifically, Lotus is trained to directly predict annotations instead of noise, thereby avoiding harmful variance. We also reformulate the diffusion process into a single-step procedure, simplifying optimization and significantly boosting inference speed. Additionally, we introduce a novel tuning strategy called detail preserver, which achieves more accurate and fine-grained predictions. Without scaling up the training data or model capacity, Lotus achieves SoTA performance in zero-shot depth and normal estimation across various datasets. It also enhances efficiency, being significantly faster than most existing diffusion-based methods. Lotus' superior quality and efficiency also enable a wide range of practical applications, such as joint estimation, single/multi-view 3D reconstruction, etc. Project page: https://lotus3d.github.io/.

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a novel adaptation of diffusion processes using x0-prediction to reduce variance and improve dense prediction accuracy.
It refines the standard multi-step process into a single-step procedure and employs a detail preserver strategy for enhanced efficiency and detail retention.
Empirical results demonstrate that Lotus outperforms existing methods in zero-shot depth estimation while enabling real-time applications with significantly less training data.

Insightful Overview of "Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction"

"Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction" by Jing He et al. introduces Lotus, a diffusion-based visual foundation model aimed at advancing dense prediction tasks such as zero-shot depth and normal estimation. The authors critically assess traditional diffusion formulations and provide strategic adaptations to enhance both the quality and efficiency of dense predictions, leading to State-of-The-Art (SoTA) performance.

The paper systematically analyzes and refines the standard diffusion process, traditionally optimized for image generation, for dense prediction tasks. The key insights derived from the analysis led to several significant modifications:

Parameterization Type: The paper highlights the inadequacy of the conventional noise prediction parameterization ( $\epsilon$ -prediction) in the context of dense predictions due to large variance at initial denoising steps. Instead, they utilize $x_0$ -prediction, which directly predicts annotations and mitigates harmful variance, resulting in more stable and accurate predictions.
Number of Time-Steps: The authors identify that the original multi-step diffusion process introduces unnecessary computational complexity and error propagation. By reformulating the process into a single-step, Lotus significantly improves both optimization and inference speed.
Detail Preserver: To counter the loss of fine details during dense prediction tasks, the paper introduces a novel "detail preserver" strategy. This mechanism allows the model to switch between tasks of generating annotations and reconstructing input images, preserving fine-grained details and enhancing prediction accuracy.

The empirical results are compelling. Lotus-G (generative variant) outperforms existing methods across key geometrical datasets. For instance, in zero-shot depth estimation, Lotus achieves the best average rank across datasets like NYUv2, KITTI, ETH3D, and ScanNet. Specifically, Lotus-G attains an AbsRel error of 5.4 on NYUv2, outperforming Marigold (5.5) and DepthAnything (4.3) even with considerably less training data (59K vs. 62.6M images for DepthAnything).

Additionally, the proposed single-step diffusion process leads to dramatic efficiency gains. Lotus is reported to be hundreds of times faster than other diffusion-based methods. This efficiency opens up various practical applications, such as joint estimation tasks and single/multi-view 3D reconstruction, while ensuring real-time applicability.

On the theoretical front, this work underscores the necessity of revisiting and customizing foundational model formulations when adapting them to new tasks. The systematic dissection of noise prediction and multi-step processes reveals potential oversights in adopting pre-trained models uncritically. Such insights are invaluable for future explorations in adapting generative models for diverse predictive tasks.

Future Developments

Several exciting directions emerge from this work:

Scaling Training Data: Similar to DepthAnything, the performance of Lotus may be further enhanced by scaling up the training datasets. This could unlock higher levels of accuracy and robustness.
Expanding to Other Dense Prediction Tasks: While primarily focused on depth and normal estimation, the adaptable framework of Lotus can be extended to other pixel-level tasks like segmentation and image matting, potentially broadening its impact.
Stochastic Predictions: Exploiting the stochastic nature of diffusion models, Lotus demonstrates the practical utility of generating uncertainty maps, which can be integral in applications requiring probabilistic interpretations.

The findings and methodologies proposed by this paper pave the way for future explorations in visual predictive modeling, particularly in efficiently leveraging pre-trained diffusion models for accurate and real-time dense predictions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1839513058982404191

https://twitter.com/haodongli00/status/1839524569058582884

https://twitter.com/haodongli00/status/1842992775631167677

https://twitter.com/mervenoyann/status/1849081683129057291

https://twitter.com/Jingheya/status/1839553365870784563

https://twitter.com/javaeeeee1/status/1839786076174467440