Online Robot Navigation and Manipulation with Distilled Vision-Language Models (2401.17083v4)
Abstract: Autonomous robot navigation within the dynamic unknown environment is of crucial significance for mobile robotic applications including robot navigation in last-mile delivery and robot-enabled automated supplies in industrial and hospital delivery applications. Current solutions still suffer from limitations, such as the robot cannot recognize unknown objects in real-time and cannot navigate freely in a dynamic, narrow, and complex environment. We propose a complete software framework for autonomous robot perception and navigation within very dense obstacles and dense human crowds. First, we propose a framework that accurately detects and segments open-world object categories in a zero-shot manner, which overcomes the over-segmentation limitation of the current SAM model. Second, we proposed the distillation strategy to distill the knowledge to segment the free space of the walkway for robot navigation without the label. In the meantime, we design the trimming strategy that works collaboratively with distillation to enable lightweight inference to deploy the neural network on edge devices such as NVIDIA-TX2 or Xavier NX during autonomous navigation. Integrated into the robot navigation system, extensive experiments demonstrate that our proposed framework has achieved superior performance in terms of both accuracy and efficiency in robot scene perception and autonomous robot navigation.
- K. Liu, A. Xiao, X. Zhang, S. Lu, and L. Shao, “Fac: 3d representation learning via foreground aware feature contrast,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9476–9485.
- K. Liu and M. Cao, “Dlc-slam: A robust lidar-slam system with learning-based denoising and loop closure,” IEEE/ASME Transactions on Mechatronics, 2023.
- K. Liu, “Rm3d: Robust data-efficient 3d scene parsing via traditional and learnt 3d descriptors-based semantic region merging,” International Journal of Computer Vision, vol. 131, no. 4, pp. 938–967, 2022.
- K. Liu, Z. Gao, F. Lin, and B. M. Chen, “Fg-net: Fast large-scale lidar point clouds understanding network leveraging correlated feature mining and geometric-aware modelling,” IEEE Transactions on Cybernetics, 2020.
- K. Liu and B. M. Chen, “Industrial uav-based unsupervised domain adaptive crack recognitions: From database towards real-site infrastructural inspections,” IEEE Transactions on Industrial Electronics, vol. 70, no. 9, pp. 9410–9420, 2022.
- K. Liu, Y. Zhao, Q. Nie, Z. Gao, and B. M. Chen, “Weakly supervised 3d scene segmentation with region-level boundary awareness and instance discrimination,” in European Conference on Computer Vision. Springer, 2022, pp. 37–55.
- ——, “Ws3d supplementary material,” in European Conference on Computer Vision (ECCV). Springer, Cham, 2022, pp. 37–55.
- S. Hong, J. He, X. Zheng, H. Wang, H. Fang, K. Liu, C. Zheng, and S. Shen, “Liv-gaussmap: Lidar-inertial-visual fusion for real-time 3d radiance field map rendering,” arXiv preprint arXiv:2401.14857, 2024.
- K. Liu, X. Zheng, C. Wang, H. Wang, M. Liu, and K. Tang, “Robotic online navigation and manipulation with distilled vision-language models,” arXiv preprint arXiv:2401.17083, 2024.
- K. Liu, “A robust and efficient lidar-inertial-visual fused simultaneous localization and mapping system with loop closure,” in 2022 12th international conference on CYBER technology in automation, control, and intelligent systems (CYBER). IEEE, 2022, pp. 1182–1187.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- R. Dale, “Gpt-3: What’s it good for?” Natural Language Engineering, vol. 27, no. 1, pp. 113–118, 2021.
- I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese, “3d semantic parsing of large-scale indoor spaces,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1534–1543.
- K. Liu, A. Xiao, J. Huang, K. Cui, Y. Xing, and S. Lu, “D-lc-nets: Robust denoising and loop closing networks for lidar slam in complicated circumstances with noisy point clouds,” in 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 12 212–12 218.
- A. Awadalla, I. Gao, J. Gardner, J. Hessel, Y. Hanafy, W. Zhu, K. Marathe, Y. Bitton, S. Gadre, S. Sagawa et al., “Openflamingo: An open-source framework for training large autoregressive vision-language models,” arXiv preprint arXiv:2308.01390, 2023.
- M. Shen, P. Molchanov, H. Yin, and J. M. Alvarez, “When to prune? a policy towards early structural pruning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 12 247–12 256.
- H. Tanaka, D. Kunin, D. L. Yamins, and S. Ganguli, “Pruning neural networks without any data by iteratively conserving synaptic flow,” Advances in neural information processing systems, vol. 33, pp. 6377–6389, 2020.
- S. J. Kwon, D. Lee, B. Kim, P. Kapoor, B. Park, and G.-Y. Wei, “Structured compression by weight encryption for unstructured pruning and quantization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 1909–1918.
- X. Sui, Q. Lv, L. Zhi, B. Zhu, Y. Yang, Y. Zhang, and Z. Tan, “A hardware-friendly high-precision cnn pruning method and its fpga implementation,” Sensors, vol. 23, no. 2, p. 824, 2023.
- A. Balasubramaniam, F. P. Sunny, and S. Pasricha, “R-toss: A framework for real-time object detection using semi-structured pruning,” arXiv preprint arXiv:2303.02191, 2023.
- G. Fang, X. Ma, M. Song, M. B. Mi, and X. Wang, “Depgraph: Towards any structural pruning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16 091–16 101.
- T. Zhang, S. Ye, X. Feng, X. Ma, K. Zhang, Z. Li, J. Tang, S. Liu, X. Lin, Y. Liu et al., “Structadmm: Achieving ultrahigh efficiency in structured pruning for dnns,” IEEE transactions on neural networks and learning systems, vol. 33, no. 5, pp. 2259–2273, 2021.
- T. Liang, J. Glossner, L. Wang, S. Shi, and X. Zhang, “Pruning and quantization for deep neural network acceleration: A survey,” Neurocomputing, vol. 461, pp. 370–403, 2021.
- Z. Guo, H. Yan, H. Li, and X. Lin, “Class attention transfer based knowledge distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 868–11 877.
- K. Li, M. Li, and U. D. Hanebeck, “Towards high-performance solid-state-lidar-inertial odometry and mapping,” IEEE Robotics and Automation Letters, vol. 6, no. 3, pp. 5167–5174, 2021.
- K. Liu, X. Zhou, and B. M. Chen, “An enhanced lidar inertial localization and mapping system for unmanned ground vehicles,” in 2022 IEEE 17th International Conference on Control & Automation (ICCA). IEEE, 2022, pp. 587–592.
- K. Liu and H. Ou, “A light-weight lidar-inertial slam system with high efficiency and loop closure detection capacity,” in 2022 International Conference on Advanced Robotics and Mechatronics (ARM). IEEE, 2022, pp. 284–289.
- K. Liu, X. Zhou, B. Zhao, H. Ou, and B. M. Chen, “An integrated visual system for unmanned aerial vehicles following ground vehicles: Simulations and experiments,” in 2022 IEEE 17th International Conference on Control & Automation (ICCA). IEEE, 2022, pp. 593–598.
- K. Liu, X. Han, and B. M. Chen, “Deep learning based automatic crack detection and segmentation for unmanned aerial vehicle inspections,” in 2019 IEEE international conference on robotics and biomimetics (ROBIO). IEEE, 2019, pp. 381–387.
- M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
- X. Liu, B. Li, Z. Chen, and Y. Yuan, “Generalized gradient flow based saliency for pruning deep convolutional neural networks,” International Journal of Computer Vision, pp. 1–15, 2023.
- L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5188–5197.
- Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
- D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact++: Better real-time instance segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- S. Liang, H. Wu, L. Zhen, Q. Hua, S. Garg, G. Kaddoum, M. M. Hassan, and K. Yu, “Edge yolo: Real-time intelligent object detection system based on edge-cloud cooperation in autonomous vehicles,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 12, pp. 25 345–25 360, 2022.
- P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
- A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.
- M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International conference on machine learning. PMLR, 2019, pp. 6105–6114.
- Kangcheng Liu (21 papers)