MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation (2307.14460v1)

Published 26 Jul 2023 in cs.CV

Abstract: We release MiDaS v3.1 for monocular depth estimation, offering a variety of new models based on different encoder backbones. This release is motivated by the success of transformers in computer vision, with a large variety of pretrained vision transformers now available. We explore how using the most promising vision transformers as image encoders impacts depth estimation quality and runtime of the MiDaS architecture. Our investigation also includes recent convolutional approaches that achieve comparable quality to vision transformers in image classification tasks. While the previous release MiDaS v3.0 solely leverages the vanilla vision transformer ViT, MiDaS v3.1 offers additional models based on BEiT, Swin, SwinV2, Next-ViT and LeViT. These models offer different performance-runtime tradeoffs. The best model improves the depth estimation quality by 28% while efficient models enable downstream tasks requiring high frame rates. We also describe the general process for integrating new backbones. A video summarizing the work can be found at https://youtu.be/UjaeNNFf9sE and the code is available at https://github.com/isl-org/MiDaS.

Citations (91)

View on Semantic Scholar

Summary

The paper introduces MiDaS v3.1, which leverages transformer-based encoders to achieve a 28% improvement in monocular depth estimation quality.
It integrates multiple vision transformers and convolutional backbones within an encoder-decoder framework to optimize performance and scalability.
The enhanced model, validated across diverse benchmarks and datasets, offers broad applicability in fields like 3D reconstruction and autonomous driving.

MiDaS v3.1: Enhancements in Monocular Relative Depth Estimation

The paper entitled "MiDaS v3.1 -- A Model Zoo for Robust Monocular Relative Depth Estimation" presents an advanced iteration of the MiDaS framework designed for monocular depth estimation. This release builds upon previous versions by incorporating a broader array of transformer-based encoder backbones, leading to significant improvements in depth estimation quality and runtime efficiency.

Research Contribution

MiDaS v3.1 introduces a variety of new models, leveraging recent advancements in vision transformers like BEiT, Swin, SwinV2, Next-ViT, and LeViT, alongside convolutional approaches. The integration of these models seeks to explore and capitalize on the performance-runtime trade-offs across different encoder types. The best-performing model demonstrates a 28% improvement in depth estimation quality, signifying a significant enhancement over previous models.

Methodological Innovations

The architecture of MiDaS v3.1 retains the encoder-decoder paradigm but integrates state-of-the-art vision encoder transformers. This includes hierarchical encoders that improve depth estimation quality by utilizing sophisticated attention mechanisms that were successful in other vision tasks.

Key methodological highlights include:

Encoder Integration: The paper details the meticulous process of integrating these backbones, including the handling of new encoder-decoder connections tailored for each specific architecture.
Dataset Augmentation: The training datasets have expanded to include KITTI and NYU Depth v2, which enhances the model's applicability across diverse environments.
Framework Scalability: The paper provides a general strategy for incorporating future backbones into MiDaS, emphasizing its adaptability to upcoming technological advances.

Numerical Results and Analysis

The research presents extensive evaluations across several benchmarks, including DIW, ETH3D, Sintel, KITTI, NYU Depth v2, and TUM, utilizing metrics such as WHDR, REL, and $\delta_1$ . The BEiT\textsubscript{512}-L model consistently achieved superior performance, underscoring its improvement over previous iterations.

Moreover, the models accommodate a range of compute capabilities, highlighted by the introduction of lightweight models like the LeViT backbone, which provides efficient depth estimation on devices with limited computing power.

Implications and Future Directions

The improvements in MiDaS v3.1 have substantial implications for applications like generative AI, 3D reconstruction, and autonomous driving. Enhanced model accuracy in relative depth estimation translates to more reliable inputs for downstream tasks, such as those in large-scale 3D scenes or interactive environments.

Future work may focus on further refining these models, possibly integrating more diverse datasets for training to boost generalization. Additionally, exploring hybrid models combining transformers and convolutional elements could yield even more advanced depth estimation capabilities.

In conclusion, MiDaS v3.1 represents a significant step forward in monocular depth estimation. Its improved model zoo offers robust and versatile options for researchers and practitioners aiming to enhance AI systems across a spectrum of applications.