- The paper introduces EGformer, a method that combines local attention with equirectangular geometry bias for efficient 360 depth estimation.
- It utilizes equirectangular relative position embedding, distance-based attention scoring, and attention rearrangement to balance local and global context.
- Experiments on Structured3D and Pano3D datasets demonstrate superior depth quality with reduced computational cost compared to state-of-the-art methods.
Introduction
The estimation of depth from 360-degree (equirectangular) images presents unique challenges due to the inherent distortion in their 180° x 360° field of view. While convolutional neural networks (CNNs) struggle with this task due to their limited receptive field, transformers with global attention have shown superior results but at a computational cost that hinders their practicality. The need for a balanced approach is clear, one that can handle the equirectangular geometry effectively while maintaining computational efficiency. Addressing this, we introduce EGformer, an equirectangular geometry-biased transformer, that harnesses local attention to achieve high-quality depth estimation without the computational overhead associated with global attention methods.
Background
Equirectangular images are mapped from spherical surfaces to flat planes, leading to distortion that complicates depth estimation. Prior works have sought to navigate the geometry directly through specialized convolutional approaches or indirectly by leveraging the inherent structure in equirectangular images. Parallel to this, the emergence of transformers in vision tasks, including depth estimation, has reshaped the landscape with their ability to model long-range dependencies. However, their quadratic computational complexity poses a significant challenge, especially for high-resolution 360-degree images. Local attention mechanisms have been proposed as a remedy, but they often fall short when dealing with the equirectangular geometry and the expansive field of view of these images.
Methodology
EGformer innovatively combines the strength of local attention with an explicit equirectangular geometry bias, addressing both the distorted geometry and the need for a large receptive field. The architecture comprises equirectangular relative position embedding (ERPE), distance-based attention score (DAS), and equirectangular-aware attention rearrangement (EaAR). ERPE and DAS work within local windows to account for the equirectangular geometry, while EaAR modulates these local attentions across the image, compensating for the limited receptive field typically associated with local attention methods. This arrangement allows EGformer to capture the global context efficiently, substantially reducing computational demand and model parameters without compromising depth estimation quality.
Experimental Validation
We tested EGformer against recent state-of-the-art methods on Structured3D and Pano3D datasets, showing superior performance in terms of both depth estimation quality and computational efficiency. Specifically, EGformer achieved the best results across all metrics on both datasets while having the lowest computational cost and fewest parameters among the transformers evaluated. These results not only demonstrate the effectiveness of incorporating equirectangular geometry biases into the transformer-based architecture but also highlight the model's practicality for real-world applications.
Implications and Future Work
The success of EGformer in 360-degree depth estimation presents a significant step forward in efficiently processing equirectangular images for depth-related tasks. By demonstrating that local attention can be leveraged successfully when properly biased towards the image geometry, EGformer opens the door to further exploration of geometry-biased models in other tasks involving spherical or otherwise geometrically distorted images. Future developments may explore the extension of this approach to dynamic scenes, further efficiency improvements, and applications beyond depth estimation.
Conclusion
EGformer represents a novel approach in the intersection of transformer models and equirectangular image processing, addressing the significant challenges of depth estimation in 360-degree images. This work lays the groundwork for future research in efficiently and effectively leveraging the unique properties of equirectangular geometry within the transformer framework, promising significant advancements in immersive technology applications.