Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration (2408.09336v1)
Abstract: 360 cameras capture the entire surrounding environment with a large FoV, exhibiting comprehensive visual information to directly infer the 3D structures, e.g., depth and surface normal, and semantic information simultaneously. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. Achieving such an objective is, however, challenging due to: 1) inherent spherical distortion of planar equirectangular projection (ERP) and insufficient global perception induced by 360 image's ultra-wide FoV; 2) non-trivial progress in effectively merging geometry and semantics among different tasks to achieve mutual benefits. In this paper, we propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via depth and surface normal estimation, and semantics via semantic segmentation simultaneously. Our key idea is to build a representation with strong global perception and less distortion while exploring the inter- and cross-task relationships between geometry and semantics. We incorporate the distortion-free and spatially continuous icosahedron projection (ICOSAP) points and combine them with ERP to enhance global perception. With a negligible cost, a Bi-projection Bi-attention Fusion module is thus designed to capture the semantic- and distance-aware dependencies between each pixel of the region-aware ERP feature and the ICOSAP point feature set. Moreover, we propose a novel Cross-task Collaboration module to explicitly extract task-specific geometric and semantic information from the learned representation to achieve preliminary predictions. It then integrates the spatial contextual information among tasks to realize cross-task fusion. Extensive experiments demonstrate the effectiveness and efficacy of Elite360M.
- A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” 2017 International Conference on 3D Vision (3DV), pp. 667–676, 2017.
- H. Jiang, Z. Sheng, S. Zhu, Z. Dong, and R. Huang, “Unifuse: Unidirectional fusion for 360∘ panorama depth estimation,” IEEE Robotics and Automation Letters, vol. 6, pp. 1519–1526, 2021.
- F.-E. Wang, Y.-H. Yeh, Y.-H. Tsai, W.-C. Chiu, and M. Sun, “Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 5, pp. 5448–5460, 2022.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
- J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5460–5469, 2022.
- T. L. T. da Silveira, P. G. L. Pinto, J. Murrugarra-Llerena, and C. R. Jung, “3d scene geometry estimation from 360∘ imagery: A survey,” ACM CSUR, 2022.
- H. Ai, Z. Cao, J. Zhu, H. Bai, Y. Chen, and L. Wang, “Deep learning for omnidirectional vision: A survey and new perspectives,” ArXiv, vol. abs/2205.10468, 2022.
- J. Zhang, K. Yang, H. Shi, S. Reiß, K. Peng, C. Ma, H. Fu, P. H. Torr, K. Wang, and R. Stiefelhagen, “Behind every domain there is a shift: Adapting distortion-aware vision transformers for panoramic semantic segmentation,” arXiv preprint arXiv:2207.11860, 2022.
- M. Crawshaw, “Multi-task learning with deep neural networks: A survey,” ArXiv, vol. abs/2009.09796, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221819295
- S. Vandenhende, S. Georgoulis, W. V. Gansbeke, M. Proesmans, D. Dai, and L. V. Gool, “Multi-task learning for dense prediction tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 3614–3633, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221771219
- D. Bhattacharjee, T. Zhang, S. Süsstrunk, and M. Salzmann, “Muit: An end-to-end multitask learning transformer,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12 021–12 031, 2022.
- H. Ye and D. Xu, “Inverted pyramid multi-task transformer for dense scene understanding,” in ECCV, 2022.
- ——, “Invpt++: Inverted pyramid multi-task transformer for visual scene understanding,” IEEE transactions on pattern analysis and machine intelligence, 2023.
- F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Bifuse: Monocular 360 depth estimation via bi-projection fusion,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 459–468, 2020.
- H. Ai, Z. Cao, Y.-P. Cao, Y. Shan, and L. Wang, “Hrdfuse: Monocular 360° depth estimation by collaboratively learning holistic-with-regional depth distributions,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 273–13 282, 2023.
- A. Karakottas, N. Zioulis, S. Samaras, D. Ataloglou, V. Gkitsas, D. Zarpalas, and P. Daras, “360∘ surface regression with a hyper-sphere loss,” 2019 International Conference on 3D Vision (3DV), pp. 258–268, 2019.
- X. Li, T. Wu, Z. Qi, G. Wang, Y. Shan, and X. Li, “Sgat4pass: Spherical geometry-aware transformer for panoramic semantic segmentation,” in International Joint Conference on Artificial Intelligence, 2023.
- K. Yang, J. Zhang, S. Reiß, X. Hu, and R. Stiefelhagen, “Capturing omni-range context for omnidirectional segmentation,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1376–1386, 2021.
- Y. yang Li, Y. Guo, Z. Yan, X. Huang, Y. Duan, and L. Ren, “Omnifusion: 360 monocular depth estimation via geometry-aware fusion,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2791–2800, 2022.
- H. Ai and L. Wang, “Elite360d: Towards efficient 360 depth estimation via semantic- and distance-aware bi-projection fusion,” ArXiv, vol. abs/2403.16376, 2024.
- I. Yun, C. Shin, H. Lee, H.-J. Lee, and C.-E. Rhee, “Egformer: Equirectangular geometry-biased transformer for 360 depth estimation,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6078–6089, 2023.
- U. Shah, M. Tukur, M. Alzubaidi, G. Pintore, E. Gobbetti, M. Househ, J. Schneider, and M. Agus, “Multipanowise: holistic deep architecture for multi-task dense prediction from a single panoramic image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1311–1321.
- Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y. Zhao, “Panoformer: Panorama transformer for indoor 360∘ depth estimation,” in European Conference on Computer Vision. Springer, 2022, pp. 195–211.
- H.-T. Cheng, C.-H. Chao, J.-D. Dong, H.-K. Wen, T.-L. Liu, and M. Sun, “Cube padding for weakly-supervised saliency prediction in 360° videos,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1420–1429, 2018.
- M. Eder, M. Shvets, J. Lim, and J.-M. Frahm, “Tangent images for mitigating spherical distortion,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12 423–12 431, 2019.
- C. Zhang, S. Liwicki, W. Smith, and R. Cipolla, “Orientation-aware semantic segmentation on icosahedron spheres,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3532–3540, 2019.
- T. Cohen, M. Weiler, B. Kicanaoglu, and M. Welling, “Gauge equivariant convolutional networks and the icosahedral cnn,” in International Conference on Machine Learning, 2019.
- C. M. Jiang, J. Huang, K. Kashinath, Prabhat, P. Marcus, and M. Nießner, “Spherical cnns on unstructured grids.” in ICLR (Poster), 2019.
- Y. Lee, J. Jeong, J. S. Yun, W. Cho, and K. jin Yoon, “Spherephd: Applying cnns on a spherical polyhedron representation of 360° images,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9173–9181, 2018.
- Y. Yoon, I. Chung, L. Wang, and K.-J. Yoon, “Spheresr: 360∘ image super-resolution with arbitrary projection via continuous spherical image representation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5676, 2021.
- X. Xu, H. Zhao, V. Vineet, S. N. Lim, and A. Torralba, “Mtformer: Multi-task learning via transformer and cross-task reasoning,” in European Conference on Computer Vision, 2022.
- J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou, “Structured3d: A large photo-realistic dataset for structured 3d modeling,” in European Conference on Computer Vision, 2019.
- K. Tateno, N. Navab, and F. Tombari, “Distortion-aware convolutional filters for dense prediction in panoramic images,” in ECCV (16), ser. Lecture Notes in Computer Science, vol. 11220. Springer, 2018, pp. 732–750.
- X. Cheng, P. Wang, Y. Zhou, C. Guan, and R. Yang, “Omnidirectional depth extension networks,” 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 589–595, 2020.
- N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras, “Omnidepth: Dense depth estimation for indoors spherical panoramas,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 448–465.
- C. Zhuang, Z. Lu, Y. Wang, J. Xiao, and Y. Wang, “Acdnet: Adaptively combined dilated convolution for monocular panorama depth estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3653–3661.
- B. Coors, A. P. Condurache, and A. Geiger, “Spherenet: Learning spherical representations for detection and classification in omnidirectional images,” in ECCV, 2018.
- G. Pintore, E. Almansa, and J. Schneider, “Slicenet: deep dense depth estimation from a single indoor panorama using a slice-based representation,” CVPR, 2021.
- H. Yu, L. He, B. Jian, W. Feng, and S. Liu, “Panelnet: Understanding 360 indoor environment via panel representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 878–887.
- M. Rey-Area, M. Yuan, and C. Richardt, “360monodepth: High-resolution 360° monocular depth estimation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3752–3762, 2021.
- C. Sun, M. Sun, and H.-T. Chen, “Hohonet: 360 indoor holistic understanding with latent horizontal features,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2573–2582, 2020.
- X. Hu, Y. An, C. Shao, and H. Hu, “Distortion convolution module for semantic segmentation of panoramic images based on the image-forming principle,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–12, 2022.
- J. Zhang, K. Yang, C. Ma, S. Reiß, K. Peng, and R. Stiefelhagen, “Bending reality: Distortion-aware transformers for adapting to panoramic semantic segmentation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16 896–16 906, 2022.
- X. Zheng, T. Pan, Y. Luo, and L. Wang, “Look at the neighbor: Distortion-aware unsupervised domain adaptation for panoramic semantic segmentation,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 18 641–18 652, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:260775596
- C. Esteves, A. Makadia, and K. Daniilidis, “Spin-weighted spherical cnns,” Advances in Neural Information Processing Systems, vol. 33, pp. 8614–8625, 2020.
- M. Shakerinava and S. Ravanbakhsh, “Equivariant networks for pixelized spheres,” in International Conference on Machine Learning. PMLR, 2021, pp. 9477–9488.
- J. Ocampo, M. A. Price, and J. D. McEwen, “Scalable and equivariant spherical cnns by discrete-continuous (DISCO) convolutions,” in ICLR. OpenReview.net, 2023.
- J. Huang, Y. Zhou, T. A. Funkhouser, and L. J. Guibas, “Framenet: Learning local canonical frames of 3d surfaces from a single rgb image,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8637–8646, 2019.
- A. Eftekhar, A. Sax, R. Bachmann, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10 766–10 776, 2021.
- G. Bae and A. J. Davison, “Rethinking inductive biases for surface normal estimation,” ArXiv, vol. abs/2403.00712, 2024.
- G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continuous pixel-wise prediction,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16 249–16 259, 2021.
- O. F. Kar, T. Yeo, A. Atanov, and A. Zamir, “3d common corruptions and data augmentation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18 941–18 952, 2022.
- B. Y. Feng, W. Yao, Z. Liu, and A. Varshney, “Deep depth estimation on 360 images with a double quaternion loss,” in 2020 International Conference on 3D Vision (3DV). IEEE, 2020, pp. 524–533.
- D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2650–2658, 2014.
- I. Misra, A. Shrivastava, A. K. Gupta, and M. Hebert, “Cross-stitch networks for multi-task learning,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3994–4003, 2016.
- S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1871–1880, 2018.
- A. Agiza, M. Neseem, and S. Reda, “Mtlora: A low-rank adaptation approach for efficient multi-task learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- W. Zeng, S. Karaoglu, and T. Gevers, “Joint 3d layout and depth prediction from a single indoor panorama image,” in European Conference on Computer Vision, 2020.
- M. Liu, S. Wang, Y. Guo, Y. He, and H. Xue, “Pano-sfmlearner: Self-supervised multi-task learning of depth and semantics in panoramic videos,” IEEE Signal Processing Letters, vol. 28, pp. 832–836, 2021.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 2015, pp. 234–241.
- Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, F. Zhao, B. Zhou, and H. Zhao, “Autoalign: Pixel-instance feature aggregation for multi-modal 3d object detection,” in International Joint Conference on Artificial Intelligence, 2022.
- Y. Man, L. Gui, and Y.-X. Wang, “Bev-guided multi-modality fusion for driving perception,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21 960–21 969, 2023.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” computer vision and pattern recognition, 2009.
- X. Wang, R. B. Girshick, A. K. Gupta, and K. He, “Non-local neural networks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803, 2017.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.
- M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 2019, pp. 6105–6114.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10 002, 2021.
- M. Shakerinava and S. Ravanbakhsh, “Equivariant networks for pixelized spheres,” Proceedings of the 38th International Conference on Machine Learning, ICML, vol. abs/2106.06662, 2021.
- H. Zhao, L. Jiang, J. Jia, P. H. S. Torr, and V. Koltun, “Point transformer,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16 239–16 248, 2020.
- S. Wang, D. Liang, J. Song, Y. Li, and W. Wu, “Dabert: Dual attention enhanced bert for semantic matching,” arXiv preprint arXiv:2210.03454, 2022.
- Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1475–1483, 2017.
- I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” international conference on 3d vision, 2016.
- S. Vandenhende, S. Georgoulis, and L. Van Gool, “Mti-net: Multi-scale task interaction networks for multi-task learning,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16. Springer, 2020, pp. 527–543.
- A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7482–7491, 2017.
- S. Guttikonda and J. R. Rambach, “Single frame semantic segmentation using multi-modal spherical images,” 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3210–3219, 2023.
- Y.-Q. Yang, Y.-X. Guo, J.-Y. Xiong, Y. Liu, H. Pan, P.-S. Wang, X. Tong, and B. Guo, “Swin3d: A pretrained transformer backbone for 3d indoor scene understanding,” 2023.
- D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
- H. Yun and H. Cho, “Achievement-based training progress balancing for multi-task learning,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16 889–16 898, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:267024006
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.