Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion (2403.16376v2)
Abstract: 360 depth estimation has recently received great attention for 3D reconstruction owing to its omnidirectional field of view (FoV). Recent approaches are predominantly focused on cross-projection fusion with geometry-based re-projection: they fuse 360 images with equirectangular projection (ERP) and another projection type, e.g., cubemap projection to estimate depth with the ERP format. However, these methods suffer from 1) limited local receptive fields, making it hardly possible to capture large FoV scenes, and 2) prohibitive computational cost, caused by the complex cross-projection fusion module design. In this paper, we propose Elite360D, a novel framework that inputs the ERP image and icosahedron projection (ICOSAP) point set, which is undistorted and spatially continuous. Elite360D is superior in its capacity in learning a representation from a local-with-global perspective. With a flexible ERP image encoder, it includes an ICOSAP point encoder, and a Bi-projection Bi-attention Fusion (B2F) module (totally ~1M parameters). Specifically, the ERP image encoder can take various perspective image-trained backbones (e.g., ResNet, Transformer) to extract local features. The point encoder extracts the global features from the ICOSAP. Then, the B2F module captures the semantic- and distance-aware dependencies between each pixel of the ERP feature and the entire ICOSAP feature set. Without specific backbone design and obvious computational cost increase, Elite360D outperforms the prior arts on several benchmark datasets.
- Deep learning for omnidirectional vision: A survey and new perspectives. ArXiv, abs/2205.10468, 2022.
- Hrdfuse: Monocular 360° depth estimation by collaboratively learning holistic-with-regional depth distributions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13273–13282, 2023.
- Joint 2d-3d-semantic data for indoor scene understanding. CoRR, abs/1702.01105, 2017.
- Omnizoomer: Learning to move and zoom in on sphere at high-resolution. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12851–12861, 2023.
- Matterport3d: Learning from RGB-D data in indoor environments. In 3DV, pages 667–676. IEEE Computer Society, 2017.
- Autoalign: Pixel-instance feature aggregation for multi-modal 3d object detection. In International Joint Conference on Artificial Intelligence, 2022.
- Cube padding for weakly-supervised saliency prediction in 360° videos. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1420–1429, 2018.
- Omnidirectional depth extension networks. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 589–595, 2020.
- Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1475–1483, 2017.
- Gauge equivariant convolutional networks and the icosahedral cnn. In International Conference on Machine Learning, 2019.
- Imagenet: A large-scale hierarchical image database. computer vision and pattern recognition, 2009.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- Tangent images for mitigating spherical distortion. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12423–12431, 2019.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015.
- Spherical cnns on unstructured grids. International Conference on Learning Representations,International Conference on Learning Representations, 2019.
- Unifuse: Unidirectional fusion for 360° panorama depth estimation. IEEE Robotics and Automation Letters, 6:1519–1526, 2021.
- Dilateformer: Multi-scale dilated transformer for visual recognition. IEEE Transactions on Multimedia, pages 1–14, 2023.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Omnidet: Surround view cameras based multi-task visual perception network for autonomous driving. IEEE Robotics and Automation Letters, 6:2830–2837, 2021.
- Deeper depth prediction with fully convolutional residual networks. 3DV 2016, pages 239–248, 2016.
- Spherephd: Applying cnns on a spherical polyhedron representation of 360° images. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9173–9181, 2018.
- 𝒮2superscript𝒮2\mathcal{S}^{2}caligraphic_S start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPTnet: Accurate panorama depth estimation on spherical surface. IEEE Robotics and Automation Letters, 8:1053–1060, 2023.
- Looking here or there? gaze following in 360-degree images. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3722–3731, 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
- Bev-guided multi-modality fusion for driving perception. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21960–21969, 2023.
- Scangan360: A generative model of realistic scanpaths for 360° images. IEEE Transactions on Visualization and Computer Graphics, 28:2003–2013, 2021.
- Salnet360: Saliency maps for omni-directional images with cnn. Signal Process. Image Commun., 69:26–34, 2017.
- Slicenet: deep dense depth estimation from a single indoor panorama using a slice-based representation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11531–11540, 2021.
- 360monodepth: High-resolution 360° monocular depth estimation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3752–3762, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI (3), pages 234–241. Springer, 2015.
- Equivariant networks for pixelized spheres. Proceedings of the 38th International Conference on Machine Learning, ICML, abs/2106.06662, 2021.
- Panoformer: Panorama transformer for indoor 360° depth estimation. In European Conference on Computer Vision, 2022.
- Kernel transformer networks for compact spherical convolution. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9434–9443, 2018.
- Hohonet: 360 indoor holistic understanding with latent horizontal features. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2573–2582, 2020.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In ICML, pages 6105–6114. PMLR, 2019.
- Distortion-aware convolutional filters for dense prediction in panoramic images. In European Conference on Computer Vision, 2018.
- Bifuse: Monocular 360 depth estimation via bi-projection fusion. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 459–468, 2020.
- Bifuse++: Self-supervised and efficient bi-projection fusion for 360° depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:5448–5460, 2022a.
- DABERT: dual attention enhanced BERT for semantic matching. In COLING, pages 1645–1654. International Committee on Computational Linguistics, 2022b.
- Non-local neural networks. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2017.
- Omnifusion: 360 monocular depth estimation via geometry-aware fusion. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2791–2800, 2022.
- Rope3d: The roadside perception dataset for autonomous driving and monocular 3d object detection task. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 21309–21318, 2022.
- Spheresr: 360∘superscript360360^{\circ}360 start_POSTSUPERSCRIPT ∘ end_POSTSUPERSCRIPT image super-resolution with arbitrary projection via continuous spherical image representation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5667–5676, 2021.
- Panelnet: Understanding 360 indoor environment via panel representation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 878–887, 2023.
- Egformer: Equirectangular geometry-biased transformer for 360 depth estimation. ArXiv, abs/2304.07803, 2023.
- Orientation-aware semantic segmentation on icosahedron spheres. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3532–3540, 2019.
- Point transformer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 16239–16248, 2020.
- Structured3d: A large photo-realistic dataset for structured 3d modeling. In European Conference on Computer Vision, 2019.
- Both style and distortion matter: Dual-path unsupervised domain adaptation for panoramic semantic segmentation. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1285–1295, 2023.
- Acdnet: Adaptively combined dilated convolution for monocular panorama depth estimation. CoRR, abs/2112.14440, 2021.
- Omnidepth: Dense depth estimation for indoors spherical panoramas. In ECCV (6), pages 453–471. Springer, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.