Papers
Topics
Authors
Recent
2000 character limit reached

Elite360M: Efficient 360 Multi-task Learning via Bi-projection Fusion and Cross-task Collaboration (2408.09336v1)

Published 18 Aug 2024 in cs.CV

Abstract: 360 cameras capture the entire surrounding environment with a large FoV, exhibiting comprehensive visual information to directly infer the 3D structures, e.g., depth and surface normal, and semantic information simultaneously. Existing works predominantly specialize in a single task, leaving multi-task learning of 3D geometry and semantics largely unexplored. Achieving such an objective is, however, challenging due to: 1) inherent spherical distortion of planar equirectangular projection (ERP) and insufficient global perception induced by 360 image's ultra-wide FoV; 2) non-trivial progress in effectively merging geometry and semantics among different tasks to achieve mutual benefits. In this paper, we propose a novel end-to-end multi-task learning framework, named Elite360M, capable of inferring 3D structures via depth and surface normal estimation, and semantics via semantic segmentation simultaneously. Our key idea is to build a representation with strong global perception and less distortion while exploring the inter- and cross-task relationships between geometry and semantics. We incorporate the distortion-free and spatially continuous icosahedron projection (ICOSAP) points and combine them with ERP to enhance global perception. With a negligible cost, a Bi-projection Bi-attention Fusion module is thus designed to capture the semantic- and distance-aware dependencies between each pixel of the region-aware ERP feature and the ICOSAP point feature set. Moreover, we propose a novel Cross-task Collaboration module to explicitly extract task-specific geometric and semantic information from the learned representation to achieve preliminary predictions. It then integrates the spatial contextual information among tasks to realize cross-task fusion. Extensive experiments demonstrate the effectiveness and efficacy of Elite360M.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. A. X. Chang, A. Dai, T. A. Funkhouser, M. Halber, M. Nießner, M. Savva, S. Song, A. Zeng, and Y. Zhang, “Matterport3d: Learning from rgb-d data in indoor environments,” 2017 International Conference on 3D Vision (3DV), pp. 667–676, 2017.
  2. H. Jiang, Z. Sheng, S. Zhu, Z. Dong, and R. Huang, “Unifuse: Unidirectional fusion for 360∘ panorama depth estimation,” IEEE Robotics and Automation Letters, vol. 6, pp. 1519–1526, 2021.
  3. F.-E. Wang, Y.-H. Yeh, Y.-H. Tsai, W.-C. Chiu, and M. Sun, “Bifuse++: Self-supervised and efficient bi-projection fusion for 360 depth estimation,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 5, pp. 5448–5460, 2022.
  4. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2021.
  5. J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5460–5469, 2022.
  6. T. L. T. da Silveira, P. G. L. Pinto, J. Murrugarra-Llerena, and C. R. Jung, “3d scene geometry estimation from 360∘ imagery: A survey,” ACM CSUR, 2022.
  7. H. Ai, Z. Cao, J. Zhu, H. Bai, Y. Chen, and L. Wang, “Deep learning for omnidirectional vision: A survey and new perspectives,” ArXiv, vol. abs/2205.10468, 2022.
  8. J. Zhang, K. Yang, H. Shi, S. Reiß, K. Peng, C. Ma, H. Fu, P. H. Torr, K. Wang, and R. Stiefelhagen, “Behind every domain there is a shift: Adapting distortion-aware vision transformers for panoramic semantic segmentation,” arXiv preprint arXiv:2207.11860, 2022.
  9. M. Crawshaw, “Multi-task learning with deep neural networks: A survey,” ArXiv, vol. abs/2009.09796, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221819295
  10. S. Vandenhende, S. Georgoulis, W. V. Gansbeke, M. Proesmans, D. Dai, and L. V. Gool, “Multi-task learning for dense prediction tasks: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 3614–3633, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:221771219
  11. D. Bhattacharjee, T. Zhang, S. Süsstrunk, and M. Salzmann, “Muit: An end-to-end multitask learning transformer,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12 021–12 031, 2022.
  12. H. Ye and D. Xu, “Inverted pyramid multi-task transformer for dense scene understanding,” in ECCV, 2022.
  13. ——, “Invpt++: Inverted pyramid multi-task transformer for visual scene understanding,” IEEE transactions on pattern analysis and machine intelligence, 2023.
  14. F.-E. Wang, Y.-H. Yeh, M. Sun, W.-C. Chiu, and Y.-H. Tsai, “Bifuse: Monocular 360 depth estimation via bi-projection fusion,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 459–468, 2020.
  15. H. Ai, Z. Cao, Y.-P. Cao, Y. Shan, and L. Wang, “Hrdfuse: Monocular 360° depth estimation by collaboratively learning holistic-with-regional depth distributions,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 273–13 282, 2023.
  16. A. Karakottas, N. Zioulis, S. Samaras, D. Ataloglou, V. Gkitsas, D. Zarpalas, and P. Daras, “360∘ surface regression with a hyper-sphere loss,” 2019 International Conference on 3D Vision (3DV), pp. 258–268, 2019.
  17. X. Li, T. Wu, Z. Qi, G. Wang, Y. Shan, and X. Li, “Sgat4pass: Spherical geometry-aware transformer for panoramic semantic segmentation,” in International Joint Conference on Artificial Intelligence, 2023.
  18. K. Yang, J. Zhang, S. Reiß, X. Hu, and R. Stiefelhagen, “Capturing omni-range context for omnidirectional segmentation,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1376–1386, 2021.
  19. Y. yang Li, Y. Guo, Z. Yan, X. Huang, Y. Duan, and L. Ren, “Omnifusion: 360 monocular depth estimation via geometry-aware fusion,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2791–2800, 2022.
  20. H. Ai and L. Wang, “Elite360d: Towards efficient 360 depth estimation via semantic- and distance-aware bi-projection fusion,” ArXiv, vol. abs/2403.16376, 2024.
  21. I. Yun, C. Shin, H. Lee, H.-J. Lee, and C.-E. Rhee, “Egformer: Equirectangular geometry-biased transformer for 360 depth estimation,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6078–6089, 2023.
  22. U. Shah, M. Tukur, M. Alzubaidi, G. Pintore, E. Gobbetti, M. Househ, J. Schneider, and M. Agus, “Multipanowise: holistic deep architecture for multi-task dense prediction from a single panoramic image,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1311–1321.
  23. Z. Shen, C. Lin, K. Liao, L. Nie, Z. Zheng, and Y. Zhao, “Panoformer: Panorama transformer for indoor 360∘ depth estimation,” in European Conference on Computer Vision.   Springer, 2022, pp. 195–211.
  24. H.-T. Cheng, C.-H. Chao, J.-D. Dong, H.-K. Wen, T.-L. Liu, and M. Sun, “Cube padding for weakly-supervised saliency prediction in 360° videos,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1420–1429, 2018.
  25. M. Eder, M. Shvets, J. Lim, and J.-M. Frahm, “Tangent images for mitigating spherical distortion,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12 423–12 431, 2019.
  26. C. Zhang, S. Liwicki, W. Smith, and R. Cipolla, “Orientation-aware semantic segmentation on icosahedron spheres,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3532–3540, 2019.
  27. T. Cohen, M. Weiler, B. Kicanaoglu, and M. Welling, “Gauge equivariant convolutional networks and the icosahedral cnn,” in International Conference on Machine Learning, 2019.
  28. C. M. Jiang, J. Huang, K. Kashinath, Prabhat, P. Marcus, and M. Nießner, “Spherical cnns on unstructured grids.” in ICLR (Poster), 2019.
  29. Y. Lee, J. Jeong, J. S. Yun, W. Cho, and K. jin Yoon, “Spherephd: Applying cnns on a spherical polyhedron representation of 360° images,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9173–9181, 2018.
  30. Y. Yoon, I. Chung, L. Wang, and K.-J. Yoon, “Spheresr: 360∘ image super-resolution with arbitrary projection via continuous spherical image representation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5667–5676, 2021.
  31. X. Xu, H. Zhao, V. Vineet, S. N. Lim, and A. Torralba, “Mtformer: Multi-task learning via transformer and cross-task reasoning,” in European Conference on Computer Vision, 2022.
  32. J. Zheng, J. Zhang, J. Li, R. Tang, S. Gao, and Z. Zhou, “Structured3d: A large photo-realistic dataset for structured 3d modeling,” in European Conference on Computer Vision, 2019.
  33. K. Tateno, N. Navab, and F. Tombari, “Distortion-aware convolutional filters for dense prediction in panoramic images,” in ECCV (16), ser. Lecture Notes in Computer Science, vol. 11220.   Springer, 2018, pp. 732–750.
  34. X. Cheng, P. Wang, Y. Zhou, C. Guan, and R. Yang, “Omnidirectional depth extension networks,” 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 589–595, 2020.
  35. N. Zioulis, A. Karakottas, D. Zarpalas, and P. Daras, “Omnidepth: Dense depth estimation for indoors spherical panoramas,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 448–465.
  36. C. Zhuang, Z. Lu, Y. Wang, J. Xiao, and Y. Wang, “Acdnet: Adaptively combined dilated convolution for monocular panorama depth estimation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3653–3661.
  37. B. Coors, A. P. Condurache, and A. Geiger, “Spherenet: Learning spherical representations for detection and classification in omnidirectional images,” in ECCV, 2018.
  38. G. Pintore, E. Almansa, and J. Schneider, “Slicenet: deep dense depth estimation from a single indoor panorama using a slice-based representation,” CVPR, 2021.
  39. H. Yu, L. He, B. Jian, W. Feng, and S. Liu, “Panelnet: Understanding 360 indoor environment via panel representation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 878–887.
  40. M. Rey-Area, M. Yuan, and C. Richardt, “360monodepth: High-resolution 360° monocular depth estimation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3752–3762, 2021.
  41. C. Sun, M. Sun, and H.-T. Chen, “Hohonet: 360 indoor holistic understanding with latent horizontal features,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2573–2582, 2020.
  42. X. Hu, Y. An, C. Shao, and H. Hu, “Distortion convolution module for semantic segmentation of panoramic images based on the image-forming principle,” IEEE Transactions on Instrumentation and Measurement, vol. 71, pp. 1–12, 2022.
  43. J. Zhang, K. Yang, C. Ma, S. Reiß, K. Peng, and R. Stiefelhagen, “Bending reality: Distortion-aware transformers for adapting to panoramic semantic segmentation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16 896–16 906, 2022.
  44. X. Zheng, T. Pan, Y. Luo, and L. Wang, “Look at the neighbor: Distortion-aware unsupervised domain adaptation for panoramic semantic segmentation,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 18 641–18 652, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:260775596
  45. C. Esteves, A. Makadia, and K. Daniilidis, “Spin-weighted spherical cnns,” Advances in Neural Information Processing Systems, vol. 33, pp. 8614–8625, 2020.
  46. M. Shakerinava and S. Ravanbakhsh, “Equivariant networks for pixelized spheres,” in International Conference on Machine Learning.   PMLR, 2021, pp. 9477–9488.
  47. J. Ocampo, M. A. Price, and J. D. McEwen, “Scalable and equivariant spherical cnns by discrete-continuous (DISCO) convolutions,” in ICLR.   OpenReview.net, 2023.
  48. J. Huang, Y. Zhou, T. A. Funkhouser, and L. J. Guibas, “Framenet: Learning local canonical frames of 3d surfaces from a single rgb image,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 8637–8646, 2019.
  49. A. Eftekhar, A. Sax, R. Bachmann, J. Malik, and A. Zamir, “Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10 766–10 776, 2021.
  50. G. Bae and A. J. Davison, “Rethinking inductive biases for surface normal estimation,” ArXiv, vol. abs/2403.00712, 2024.
  51. G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continuous pixel-wise prediction,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16 249–16 259, 2021.
  52. O. F. Kar, T. Yeo, A. Atanov, and A. Zamir, “3d common corruptions and data augmentation,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18 941–18 952, 2022.
  53. B. Y. Feng, W. Yao, Z. Liu, and A. Varshney, “Deep depth estimation on 360 images with a double quaternion loss,” in 2020 International Conference on 3D Vision (3DV).   IEEE, 2020, pp. 524–533.
  54. D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” 2015 IEEE International Conference on Computer Vision (ICCV), pp. 2650–2658, 2014.
  55. I. Misra, A. Shrivastava, A. K. Gupta, and M. Hebert, “Cross-stitch networks for multi-task learning,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3994–4003, 2016.
  56. S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task learning with attention,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1871–1880, 2018.
  57. A. Agiza, M. Neseem, and S. Reda, “Mtlora: A low-rank adaptation approach for efficient multi-task learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  58. W. Zeng, S. Karaoglu, and T. Gevers, “Joint 3d layout and depth prediction from a single indoor panorama image,” in European Conference on Computer Vision, 2020.
  59. M. Liu, S. Wang, Y. Guo, Y. He, and H. Xue, “Pano-sfmlearner: Self-supervised multi-task learning of depth and semantics in panoramic videos,” IEEE Signal Processing Letters, vol. 28, pp. 832–836, 2021.
  60. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18.   Springer, 2015, pp. 234–241.
  61. Z. Chen, Z. Li, S. Zhang, L. Fang, Q. Jiang, F. Zhao, B. Zhou, and H. Zhao, “Autoalign: Pixel-instance feature aggregation for multi-modal 3d object detection,” in International Joint Conference on Artificial Intelligence, 2022.
  62. Y. Man, L. Gui, and Y.-X. Wang, “Bev-guided multi-modality fusion for driving perception,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 21 960–21 969, 2023.
  63. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” computer vision and pattern recognition, 2009.
  64. X. Wang, R. B. Girshick, A. K. Gupta, and K. He, “Non-local neural networks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803, 2017.
  65. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2015.
  66. M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, ser. Proceedings of Machine Learning Research, vol. 97.   PMLR, 2019, pp. 6105–6114.
  67. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10 002, 2021.
  68. M. Shakerinava and S. Ravanbakhsh, “Equivariant networks for pixelized spheres,” Proceedings of the 38th International Conference on Machine Learning, ICML, vol. abs/2106.06662, 2021.
  69. H. Zhao, L. Jiang, J. Jia, P. H. S. Torr, and V. Koltun, “Point transformer,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16 239–16 248, 2020.
  70. S. Wang, D. Liang, J. Song, Y. Li, and W. Wu, “Dabert: Dual attention enhanced bert for semantic matching,” arXiv preprint arXiv:2210.03454, 2022.
  71. Y. Cheng, R. Cai, Z. Li, X. Zhao, and K. Huang, “Locality-sensitive deconvolution networks with gated fusion for rgb-d indoor semantic segmentation,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1475–1483, 2017.
  72. I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” international conference on 3d vision, 2016.
  73. S. Vandenhende, S. Georgoulis, and L. Van Gool, “Mti-net: Multi-scale task interaction networks for multi-task learning,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16.   Springer, 2020, pp. 527–543.
  74. A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7482–7491, 2017.
  75. S. Guttikonda and J. R. Rambach, “Single frame semantic segmentation using multi-modal spherical images,” 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 3210–3219, 2023.
  76. Y.-Q. Yang, Y.-X. Guo, J.-Y. Xiong, Y. Liu, H. Pan, P.-S. Wang, X. Tong, and B. Guo, “Swin3d: A pretrained transformer backbone for 3d indoor scene understanding,” 2023.
  77. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
  78. H. Yun and H. Cho, “Achievement-based training progress balancing for multi-task learning,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 16 889–16 898, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:267024006

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.