EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation (2304.07803v2)

Published 16 Apr 2023 in cs.CV

Abstract: Estimating the depths of equirectangular (i.e., 360) images (EIs) is challenging given the distorted 180 x 360 field-of-view, which is hard to be addressed via convolutional neural network (CNN). Although a transformer with global attention achieves significant improvements over CNN for EI depth estimation task, it is computationally inefficient, which raises the need for transformer with local attention. However, to apply local attention successfully for EIs, a specific strategy, which addresses distorted equirectangular geometry and limited receptive field simultaneously, is required. Prior works have only cared either of them, resulting in unsatisfactory depths occasionally. In this paper, we propose an equirectangular geometry-biased transformer termed EGformer. While limiting the computational cost and the number of network parameters, EGformer enables the extraction of the equirectangular geometry-aware local attention with a large receptive field. To achieve this, we actively utilize the equirectangular geometry as the bias for the local attention instead of struggling to reduce the distortion of EIs. As compared to the most recent EI depth estimation studies, the proposed approach yields the best depth outcomes overall with the lowest computational cost and the fewest parameters, demonstrating the effectiveness of the proposed methods.

Citations (13)

View on Semantic Scholar

Summary

The paper introduces EGformer, a method that combines local attention with equirectangular geometry bias for efficient 360 depth estimation.
It utilizes equirectangular relative position embedding, distance-based attention scoring, and attention rearrangement to balance local and global context.
Experiments on Structured3D and Pano3D datasets demonstrate superior depth quality with reduced computational cost compared to state-of-the-art methods.

Equirectangular Geometry-biased Transformer for 360 Depth Estimation

Introduction

The estimation of depth from 360-degree (equirectangular) images presents unique challenges due to the inherent distortion in their 180° x 360° field of view. While convolutional neural networks (CNNs) struggle with this task due to their limited receptive field, transformers with global attention have shown superior results but at a computational cost that hinders their practicality. The need for a balanced approach is clear, one that can handle the equirectangular geometry effectively while maintaining computational efficiency. Addressing this, we introduce EGformer, an equirectangular geometry-biased transformer, that harnesses local attention to achieve high-quality depth estimation without the computational overhead associated with global attention methods.

Background

Equirectangular images are mapped from spherical surfaces to flat planes, leading to distortion that complicates depth estimation. Prior works have sought to navigate the geometry directly through specialized convolutional approaches or indirectly by leveraging the inherent structure in equirectangular images. Parallel to this, the emergence of transformers in vision tasks, including depth estimation, has reshaped the landscape with their ability to model long-range dependencies. However, their quadratic computational complexity poses a significant challenge, especially for high-resolution 360-degree images. Local attention mechanisms have been proposed as a remedy, but they often fall short when dealing with the equirectangular geometry and the expansive field of view of these images.

Methodology

EGformer innovatively combines the strength of local attention with an explicit equirectangular geometry bias, addressing both the distorted geometry and the need for a large receptive field. The architecture comprises equirectangular relative position embedding (ERPE), distance-based attention score (DAS), and equirectangular-aware attention rearrangement (EaAR). ERPE and DAS work within local windows to account for the equirectangular geometry, while EaAR modulates these local attentions across the image, compensating for the limited receptive field typically associated with local attention methods. This arrangement allows EGformer to capture the global context efficiently, substantially reducing computational demand and model parameters without compromising depth estimation quality.

Experimental Validation

We tested EGformer against recent state-of-the-art methods on Structured3D and Pano3D datasets, showing superior performance in terms of both depth estimation quality and computational efficiency. Specifically, EGformer achieved the best results across all metrics on both datasets while having the lowest computational cost and fewest parameters among the transformers evaluated. These results not only demonstrate the effectiveness of incorporating equirectangular geometry biases into the transformer-based architecture but also highlight the model's practicality for real-world applications.

Implications and Future Work

The success of EGformer in 360-degree depth estimation presents a significant step forward in efficiently processing equirectangular images for depth-related tasks. By demonstrating that local attention can be leveraged successfully when properly biased towards the image geometry, EGformer opens the door to further exploration of geometry-biased models in other tasks involving spherical or otherwise geometrically distorted images. Future developments may explore the extension of this approach to dynamic scenes, further efficiency improvements, and applications beyond depth estimation.

Conclusion

EGformer represents a novel approach in the intersection of transformer models and equirectangular image processing, addressing the significant challenges of depth estimation in 360-degree images. This work lays the groundwork for future research in efficiently and effectively leveraging the unique properties of equirectangular geometry within the transformer framework, promising significant advancements in immersive technology applications.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now