Any360D: Towards 360 Depth Anything with Unlabeled 360 Data and Möbius Spatial Augmentation (2406.13378v1)

Published 19 Jun 2024 in cs.CV

Abstract: Recently, Depth Anything Model (DAM) - a type of depth foundation model - reveals impressive zero-shot capacity for diverse perspective images. Despite its success, it remains an open question regarding DAM's performance on 360 images that enjoy a large field-of-view (180x360) but suffer from spherical distortions. To this end, we establish, to our knowledge, the first benchmark that aims to 1) evaluate the performance of DAM on 360 images and 2) develop a powerful 360 DAM for the benefit of the community. For this, we conduct a large suite of experiments that consider the key properties of 360 images, e.g., different 360 representations, various spatial transformations, and diverse indoor and outdoor scenes. This way, our benchmark unveils some key findings, e.g., DAM is less effective for diverse 360 scenes and sensitive to spatial transformations. To address these challenges, we first collect a large-scale unlabeled dataset including diverse indoor and outdoor scenes. We then propose a semi-supervised learning (SSL) framework to learn a 360 DAM, dubbed Any360D. Under the umbrella of SSL, Any360D first learns a teacher model by fine-tuning DAM via metric depth supervision. Then, we train the student model by uncovering the potential of large-scale unlabeled data with pseudo labels from the teacher model. M\"obius transformation-based spatial augmentation (MTSA) is proposed to impose consistency regularization between the unlabeled data and spatially transformed ones. This subtly improves the student model's robustness to various spatial transformations even under severe distortions. Extensive experiments demonstrate that Any360D outperforms DAM and many prior data-specific models, e.g., PanoFormer, across diverse scenes, showing impressive zero-shot capacity for being a 360 depth foundation model.

Authors (4)

Zidong Cao (13 papers)
Jinjing Zhu (15 papers)
Weiming Zhang (135 papers)
Lin Wang (403 papers)

Summary

The paper benchmarks DAM on 360 images, revealing ERP's superior zero-shot performance and challenges with spatial distortions.
The paper introduces Any360D, a semi-supervised framework that employs a teacher-student paradigm and Möbius Spatial Augmentation to enhance depth estimation.
Experiments show that Any360D significantly improves depth estimation in diverse indoor and outdoor scenes, benefiting VR and autonomous navigation.

Any360D: Towards 360 Depth Anything with Unlabeled 360 Data and Möbius Spatial Augmentation

The paper "Any360D: Towards 360 Depth Anything with Unlabeled 360 Data and Möbius Spatial Augmentation" explores advancements in depth estimation for 360-degree images—a challenging yet critical task in 3D scene perception with various applications, including virtual reality (VR) and autonomous driving.

Introduction and Motivation

360-degree cameras capture the surrounding environment in a single shot, providing extensive fields of view but facing significant spherical distortions. Existing monocular 360 depth estimation methods often suffer from limited datasets and struggle to generalize beyond indoor scenes. This paper aims to address these challenges by first evaluating the performance of the Depth Anything Model (DAM) on 360-degree images through a comprehensive benchmark and subsequently proposing a novel semi-supervised learning framework dubbed Any360D.

Key Contributions

Benchmarking DAM on 360 Images:

The authors established the first benchmark analyzing DAM's performance on 360-degree images, considering factors such as image representations, spatial transformations, various indoor and outdoor scenes, optimization spaces, and backbone model sizes. Their findings highlighted that: - ERP (Equirectangular Projection) representations show the best zero-shot capacity. - DAM's robustness to spatial transformations like zoom and vertical rotation is limited. - Metric depth supervision improves structural details at the equator compared to disparity supervision. - DAM performs well in specific scenes but struggles with objects positioned at the equator in some indoor and outdoor scenarios.

Semi-Supervised Learning Framework:

To overcome the limitations identified, the authors proposed Any360D, a semi-supervised framework that utilizes a large-scale unlabeled dataset. This framework fine-tunes DAM using findings from the benchmark to enhance its performance on 360-degree images. The approach involves: - Collecting a diverse dataset encompassing both indoor and outdoor scenes. - Introducing a teacher-student model training paradigm. - Employing Möbius transformation-based spatial augmentation (MTSA) to improve robustness against spatial transformations.

Experimental Setup and Results

Dataset Collection

The authors collected a substantial, diverse dataset named Diverse360, comprising both indoor and outdoor scenes, to facilitate fine-tuning and evaluation. The dataset included scenes of varied complexity, reflecting real-world applications and enhancing the evaluation scope.

Performance Evaluation

The paper presents a thorough evaluation involving SOTA monocular 360 depth estimation methods and validates Any360D's performance through both qualitative and quantitative metrics. Critically, it compares the results of DAM and Any360D under different configurations and transformations. The main observations are:

Evaluation Metrics: The paper used Absolute Relative Error (Abs Rel) and Root Mean Squared Error (RMSE) to assess performance.
Different Representations: ERP yielded the best results among various 360 image representations without any post-processing.
Spatial Transformations: The DAM faced pronounced performance degradation under significant vertical rotation and zoom levels. Any360D showed improved robustness due to MTSA.
Indoor and Outdoor Scenarios: Any360D outperformed DAM significantly in diverse scenes, demonstrating notable improvements in estimating depths for objects located at the equator.

The benchmark provided a rigorous assessment, revealing crucial insights and motivating the design of the Any360D framework. The semi-supervised learning with MTSA notably enhanced the model's stability and representation capability.

Implications and Future Work

This work underscores the importance of effectively leveraging unlabeled data to enhance depth models' generalization capabilities. By integrating robust spatial augmentations and semi-supervised learning techniques, Any360D offers a promising approach to 360-depth estimation.

Practical Implications

In practical terms, Any360D's enhanced depth estimation capabilities could significantly benefit various applications. In VR, it can provide more accurate and immersive 3D experiences. In autonomous driving, improved depth perception, especially in outdoor environments, could enhance navigation and obstacle avoidance systems.

Theoretical Implications

From a theoretical perspective, the paper demonstrates the efficacy of metric depth supervision over disparity-based methods for 360 images, which could influence future research and methodologies in depth foundation models. The use of MTSA for augmenting training data presents a novel approach likely to inspire further research in spatial transformations.

Future Work

Future research directions might include:

Expanding the amount of labeled 360 data, particularly for outdoor scenes, to further enhance model training.
Investigating alternative augmentation techniques and exploring their impact on model robustness.
Applying the principles from Any360D to other computer vision tasks, such as semantic segmentation of 360 images, to evaluate their adaptability and performance gains.

Conclusion

The paper successfully establishes a comprehensive evaluation benchmark for 360 depth models, alongside introducing Any360D, a semi-supervised framework that leverages extensive unlabeled data and robust augmentations to enhance depth estimation. The results signify substantial improvement over existing SOTA methods, emphasizing the potential and effectiveness of the proposed approach. Future developments inspired by this work could drive significant advancements in 360-degree imaging technologies and applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/CSVisionPapers/status/1804363585864266057