Anyview: Generalizable Indoor 3D Object Detection with Variable Frames
Abstract: In this paper, we propose a novel network framework for indoor 3D object detection to handle variable input frame numbers in practical scenarios. Existing methods only consider fixed frames of input data for a single detector, such as monocular RGB-D images or point clouds reconstructed from dense multi-view RGB-D images. While in practical application scenes such as robot navigation and manipulation, the raw input to the 3D detectors is the RGB-D images with variable frame numbers instead of the reconstructed scene point cloud. However, the previous approaches can only handle fixed frame input data and have poor performance with variable frame input. In order to facilitate 3D object detection methods suitable for practical tasks, we present a novel 3D detection framework named AnyView for our practical applications, which generalizes well across different numbers of input frames with a single model. To be specific, we propose a geometric learner to mine the local geometric features of each input RGB-D image frame and implement local-global feature interaction through a designed spatial mixture module. Meanwhile, we further utilize a dynamic token strategy to adaptively adjust the number of extracted features for each frame, which ensures consistent global feature density and further enhances the generalization after fusion. Extensive experiments on the ScanNet dataset show our method achieves both great generalizability and high detection accuracy with a simple and clean architecture containing a similar amount of parameters with the baselines.
- Sp-det: Leveraging saliency prediction for voxel-based 3d object detection in sparse point cloud. IEEE Transactions on Multimedia, 2023. doi: 10.1109/TMM.2023.3304054.
- Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1090–1099, 2022.
- End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
- Multi-view 3d object detection network for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1907–1915, 2017.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2432–2443, 2017.
- From multi-view to hollow-3d: Hallucinated hollow-3d r-cnn for 3d object detection. IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4722–4734, 2021.
- Amodal detection of 3d objects: Inferring 3d bounding boxes from 2d ones in rgb-depth images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5762–5770, 2017.
- Hcpvf: Hierarchical cascaded point-voxel fusion for 3d object detection. IEEE Transactions on Circuits and Systems for Video Technology, 2023. doi: 10.1109/TCSVT.2023.3268849.
- Relation graph network for 3d object detection in point clouds. IEEE Transactions on Image Processing, 30:92–107, 2020.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012.
- 3d semantic segmentation with submanifold sparse convolutional networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9224–9232, 2018.
- Probabilistic visual navigation with bidirectional image prediction. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1539–1546. IEEE, 2021.
- Voting and attention-based pose relation learning for object pose estimation from 3d point clouds. IEEE Robotics and Automation Letters, 7(4):8980–8987, 2022.
- 3d-sis: 3d semantic instance segmentation of rgb-d scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4421–4430, 2019.
- Real-time 3d single object tracking with transformer. IEEE Transactions on Multimedia, 25:2339–2353, 2023. doi: 10.1109/TMM.2022.3146714.
- 3d-rcnn: Instance-level 3d object reconstruction via render-and-compare. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3559–3568, 2018.
- Database-assisted object retrieval for real-time 3d reconstruction. In Computer Graphics Forum, volume 34, pages 435–446, 2015.
- Tinypillarnet: Tiny pillar-based network for 3d point cloud object detection at edge. IEEE Transactions on Circuits and Systems for Video Technology, 2023. doi: 10.1109/TCSVT.2023.3297620.
- Asist: automatic semantically invariant scene transformation. Computer Vision and Image Understanding, 157:284–299, 2017.
- Centertube: Tracking multiple 3d objects with 4d tubelets in dynamic point clouds. IEEE Transactions on Multimedia, 2023. doi: 10.1109/TMM.2023.3241548.
- Multi-modal fusion based on depth adaptive mechanism for 3d object detection. IEEE Transactions on Multimedia, 2023. doi: 10.1109/TMM.2023.3270638.
- Epnet++: Cascade bi-directional fusion for multi-modal 3d object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(7):8324–8341, 2023.
- An end-to-end transformer model for 3d object detection. In proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2906–2917, 2021.
- Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In Proceedings of the European Conference on Computer Vision, 2012.
- Locating 3d object proposals: A depth-based online approach. IEEE Transactions on Circuits and Systems for Video Technology, 28(3):626–639, 2016.
- Frustum-pointpillars: A multi-stage approach for 3d object detection using rgb camera and lidar. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2926–2933, 2021.
- Real-time visual–inertial slam based on adaptive keyframe selection for mobile ar applications. IEEE Transactions on Multimedia, 21(11):2827–2836, 2019.
- Imvotenet: Boosting 3d object detection in point clouds with image votes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4404–4413, 2020.
- Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9277–9286, 2019.
- Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
- A dataset for improved rgbd-based object detection and pose estimation for warehouse pick-and-place. IEEE Robotics and Automation Letters, 1(2):1179–1185, 2016.
- Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2397–2406, 2022.
- From points to parts: 3d object detection from point cloud with part-aware and part-aggregation network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8):2647–2664, 2021.
- Roarnet: A robust 3d object detection based on region approximation refinement. In IEEE Intelligent Vehicles Symposium, pages 2510–2515. IEEE, 2019.
- Sun rgb-d: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 567–576, 2015.
- Sliding shapes for 3d object detection in depth images. In Proceedings of the European Conference on Computer Vision, pages 634–651, 2014.
- Deep sliding shapes for amodal 3d object detection in rgb-d images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 808–816, 2016.
- Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020.
- Searching efficient 3d architectures with sparse point-voxel convolution. In Proceedings of the European Conference on Computer Vision, 2020.
- Multi-source features fusion single stage 3d object detection with transformer. IEEE Robotics and Automation Letters, 8(4):2062–2069, 2023.
- Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
- Pointpainting: Sequential fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4604–4612, 2020.
- Geopose: Dense reconstruction guided 6d object pose estimation with geometric consistency. IEEE Transactions on Multimedia, 24:4394–4408, 2021.
- Adversarial obstacle generation against lidar-based 3d object detection. IEEE Transactions on Multimedia, 2023. doi: 10.1109/TMM.2023.3302018.
- Da-net: Density-aware 3d object detection network for point clouds. IEEE Transactions on Multimedia, 2023. doi: 10.1109/TMM.2023.3245359.
- Frustum convnet: Sliding frustums to aggregate local point-wise features for amodal 3d object detection. In IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1742–1749. IEEE, 2019.
- Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters, 5(2):713–720, 2020.
- Mutrans: Multiple transformers for fusing feature pyramid on 2d and 3d object detection. IEEE Transactions on Image Processing, 32:4407–4415, 2023. doi: 10.1109/TIP.2023.3299190.
- Farp-net: Local-global feature aggregation and relation-aware proposals for 3d object detection. IEEE Transactions on Multimedia, 2023. doi: 10.1109/TMM.2023.3275366.
- Pointfusion: Deep sensor fusion for 3d bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 244–253, 2018.
- Fusionpainting: Multimodal fusion with adaptive attention for 3d object detection. In IEEE International Intelligent Transportation Systems Conference, pages 3047–3054. IEEE, 2021.
- Ceap-360vr: A continuous physiological and behavioral emotion annotation dataset for 360 vr videos. IEEE Transactions on Multimedia, 25:243–255, 2023. doi: 10.1109/TMM.2021.3124080.
- Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
- Multimodal virtual point 3d detection. Advances in Neural Information Processing Systems, 34:16494–16507, 2021.
- Multi-object navigation using potential target position policy function. IEEE Transactions on Image Processing, 2023. doi: 10.1109/TIP.2023.3263110.
- Self-supervised pretraining of 3d features on any point-cloud. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10252–10263, 2021.
- 3d vehicle detection using multi-level fusion from point clouds and images. IEEE Transactions on Intelligent Transportation Systems, 23(9):15146–15154, 2022.
- Transformer3d-det: Improving 3d object detection by vote refinement. IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4735–4746, 2021.
- Sgm3d: stereo guided monocular 3d object detection. IEEE Robotics and Automation Letters, 7(4):10478–10485, 2022.
- Vpfnet: Improving 3d object detection with virtual point based lidar and stereo data fusion. IEEE Transactions on Multimedia, pages 1–14, 2022. doi: 10.1109/TMM.2022.3189778.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.