Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Memory-based Adapters for Online 3D Scene Perception (2403.06974v1)

Published 11 Mar 2024 in cs.CV

Abstract: In this paper, we propose a new framework for online 3D scene perception. Conventional 3D scene perception methods are offline, i.e., take an already reconstructed 3D scene geometry as input, which is not applicable in robotic applications where the input data is streaming RGB-D videos rather than a complete 3D scene reconstructed from pre-collected RGB-D videos. To deal with online 3D scene perception tasks where data collection and perception should be performed simultaneously, the model should be able to process 3D scenes frame by frame and make use of the temporal information. To this end, we propose an adapter-based plug-and-play module for the backbone of 3D scene perception model, which constructs memory to cache and aggregate the extracted RGB-D features to empower offline models with temporal learning ability. Specifically, we propose a queued memory mechanism to cache the supporting point cloud and image features. Then we devise aggregation modules which directly perform on the memory and pass temporal information to current frame. We further propose 3D-to-2D adapter to enhance image features with strong global context. Our adapters can be easily inserted into mainstream offline architectures of different tasks and significantly boost their performance on online tasks. Extensive experiments on ScanNet and SceneNN datasets demonstrate our approach achieves leading performance on three 3D scene perception tasks compared with state-of-the-art online methods by simply finetuning existing offline models, without any model and task-specific designs. \href{https://xuxw98.github.io/Online3D/}{Project page}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. 3d semantic parsing of large-scale indoor spaces. In ICCV, pages 1534–1543, 2016.
  2. Massively parallel video networks. In ECCV, pages 649–666, 2018.
  3. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  4. Object goal navigation using goal-oriented semantic exploration. NeurIPS, 33:4247–4258, 2020.
  5. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  6. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, pages 3075–3084, 2019.
  7. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, pages 5828––5839, 2017.
  8. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  9. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In ICRA, pages 1355–1361, 2017.
  10. Benjamin Graham. Spatially-sparse convolutional neural networks. arXiv preprint arXiv:1409.6070, 2014.
  11. 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, pages 9224–9232, 2018.
  12. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  13. 3d-sis: 3d semantic instance segmentation of rgb-d scans. In CVPR, pages 4421–4430, 2019.
  14. Pri3d: Can 3d priors help 2d representation learning? In ICCV, pages 5693–5702, 2021.
  15. Scenenn: A scene meshes dataset with annotations. In 3DV, pages 92–101, 2016.
  16. Supervoxel convolution for online 3d semantic segmentation. TOG, 40(3):1–15, 2021.
  17. Pointgroup: Dual-set point grouping for 3d instance segmentation. In CVPR, pages 4867–4876, 2020.
  18. Top-down beats bottom-up in 3d instance segmentation. arXiv preprint arXiv:2302.02871, 2023.
  19. Movinets: Mobile video networks for efficient video recognition. In CVPR, pages 16020–16030, 2021.
  20. Tsm: Temporal shift module for efficient video understanding. In ICCV, pages 7083–7093, 2019.
  21. Feature pyramid networks for object detection. In CVPR, pages 2117–2125, 2017.
  22. Ins-conv: Incremental sparse convolution for online 3d segmentation. In CVPR, pages 18975–18984, 2022.
  23. Point-voxel cnn for efficient 3d deep learning. In NeurIPS, pages 963–973, 2019.
  24. Semanticfusion: Dense 3d semantic mapping with convolutional neural networks. In ICRA, pages 4628–4635. IEEE, 2017.
  25. 6-dof graspnet: Variational grasp generation for object manipulation. In ICCV, pages 2901–2910, 2019.
  26. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In IROS, pages 4205–4212. IEEE, 2019.
  27. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  28. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  29. St-adapter: Parameter-efficient image-to-video transfer learning. NeurIPS, 35:26462–26477, 2022.
  30. Colored point cloud registration revisited. In ICCV, pages 143–152, 2017.
  31. Volumetric and multi-view cnns for object classification on 3d data. In CVPR, pages 5648–5656, 2016.
  32. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, pages 652–660, 2017a.
  33. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, pages 5099–5108, 2017b.
  34. Deep hough voting for 3d object detection in point clouds. In ICCV, pages 9277–9286, 2019.
  35. Poni: Potential functions for objectgoal navigation with interaction-free learning. In CVPR, pages 18890–18900, 2022.
  36. Faster r-cnn: towards real-time object detection with region proposal networks. TPAMI, 39(6):1137–1149, 2016.
  37. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  38. Fcaf3d: fully convolutional anchor-free 3d object detection. In ECCV, pages 477–493. Springer, 2022.
  39. Tr3d: Towards real-time indoor 3d object detection. arXiv preprint arXiv:2302.02858, 2023.
  40. Mask3d for 3d semantic instance segmentation. arXiv preprint arXiv:2210.03105, 2022.
  41. Sun rgb-d: A rgb-d scene understanding benchmark suite. In CVPR, pages 567–576, 2015.
  42. Softgroup for 3d instance segmentation on point clouds. In CVPR, pages 2708–2717, 2022.
  43. Cagroup3d: Class-aware grouping for 3d object detection on point clouds. arXiv preprint arXiv:2210.04264, 2022.
  44. Anyview: generalizable indoor 3d object detection with variable frames. arXiv preprint arXiv:2310.05346, 2022.
  45. Gspn: Generative shape proposal network for 3d instance segmentation in point cloud. In CVPR, pages 3947–3956, 2019.
  46. Fusion-aware point convolution for online semantic 3d scene segmentation. In CVPR, pages 4534–4543, 2020.
  47. 3d-aware object goal navigation via simultaneous exploration and identification. In CVPR, pages 6672–6682, 2023.
  48. The surprising effectiveness of visual odometry techniques for embodied pointgoal navigation. In ICCV, pages 16127–16136, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiuwei Xu (16 papers)
  2. Chong Xia (2 papers)
  3. Ziwei Wang (128 papers)
  4. Linqing Zhao (7 papers)
  5. Yueqi Duan (47 papers)
  6. Jie Zhou (687 papers)
  7. Jiwen Lu (192 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.