Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BEVTrack: A Simple and Strong Baseline for 3D Single Object Tracking in Bird's-Eye View (2309.02185v6)

Published 5 Sep 2023 in cs.CV and cs.AI

Abstract: 3D Single Object Tracking (SOT) is a fundamental task of computer vision, proving essential for applications like autonomous driving. It remains challenging to localize the target from surroundings due to appearance variations, distractors, and the high sparsity of point clouds. To address these issues, prior Siamese and motion-centric trackers both require elaborate designs and solving multiple subtasks. In this paper, we propose BEVTrack, a simple yet effective baseline method. By estimating the target motion in Bird's-Eye View (BEV) to perform tracking, BEVTrack demonstrates surprising simplicity from various aspects, i.e., network designs, training objectives, and tracking pipeline, while achieving superior performance. Besides, to achieve accurate regression for targets with diverse attributes (e.g., sizes and motion patterns), BEVTrack constructs the likelihood function with the learned underlying distributions adapted to different targets, rather than making a fixed Laplacian or Gaussian assumption as in previous works. This provides valuable priors for tracking and thus further boosts performance. While only using a single regression loss with a plain convolutional architecture, BEVTrack achieves state-of-the-art performance on three large-scale datasets, KITTI, NuScenes, and Waymo Open Dataset while maintaining a high inference speed of about 200 FPS. The code will be released at https://github.com/xmm-prio/BEVTrack.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 784–11 793.
  2. Y. Cai, L. Dai, H. Wang, and Z. Li, “Multi-target pan-class intrinsic relevance driven model for improving semantic segmentation in autonomous driving,” IEEE Transactions on Image Processing, vol. 30, pp. 9069–9084, 2021.
  3. H.-k. Chiu, A. Prioletti, J. Li, and J. Bohg, “Probabilistic 3d multi-object tracking for autonomous driving,” arXiv preprint arXiv:2001.05673, 2020.
  4. H. Qi, C. Feng, Z. Cao, F. Zhao, and Y. Xiao, “P2b: point-to-box network for 3d object tracking in point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6329–6338.
  5. C. Zheng, X. Yan, H. Zhang, B. Wang, S. Cheng, S. Cui, and Z. Li, “Beyond 3d siamese tracking: A motion-centric paradigm for 3d single object tracking in point clouds,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2022, pp. 8111–8120.
  6. C. Zheng, X. Yan, J. Gao, W. Zhao, W. Zhang, Z. Li, and S. Cui, “Box-aware feature enhancement for single object tracking on point clouds,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 199–13 208.
  7. J. Nie, Z. He, Y. Yang, Z. Bao, M. Gao, and J. Zhang, “Osp2b: One-stage point-to-box network for 3d siamese tracking,” in Proceedings of the International Joint Conference on Artificial Intelligence, 2023.
  8. C. Zhou, Z. Luo, Y. Luo, T. Liu, L. Pan, Z. Cai, H. Zhao, and S. Lu, “Pttr: Relational 3d point cloud object tracking with transformer,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8531–8540.
  9. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  10. Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” ACM Transactions on Graphics, p. 1–12, 2019.
  11. T.-X. Xu, Y.-C. Guo, Y.-K. Lai, and S.-H. Zhang, “Cxtrack: Improving 3d point cloud tracking with contextual information,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1084–1093.
  12. C. R. Qi, O. Litany, K. He, and L. J. Guibas, “Deep hough voting for 3d object detection in point clouds,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9277–9286.
  13. Z. Fang, S. Zhou, Y. Cui, and S. Scherer, “3d-siamrpn: An end-to-end learning method for real-time 3d single object tracking using raw point cloud,” IEEE Sensors Journal, p. 4995–5011, 2021.
  14. C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 652–660.
  15. L. Hui, L. Wang, L. Tang, K. Lan, J. Xie, and J. Yang, “3d siamese transformer network for single object tracking on point clouds,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 293–310.
  16. J. Nie, Z. He, Y. Yang, X. Lv, M. Gao, and J. Zhang, “Glt-t++: Global-local transformer for 3d siamese tracking with ranking loss,” arXiv preprint arXiv:2304.00242, 2023.
  17. Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 4490–4499.
  18. F. Duffhauss and S. A. Baur, “Pillarflownet: A real-time deep multitask network for lidar-based 3d object detection and scene flow estimation,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 10 734–10 741.
  19. Y. Yan, Y. Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018.
  20. A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3354–3361.
  21. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “Nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 621–11 631.
  22. P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: waymo open dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2446–2454.
  23. A. Pieropan, N. Bergström, M. Ishikawa, and H. Kjellström, “Robust 3d tracking of unknown objects,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2015, pp. 2410–2417.
  24. L. Spinello, K. Arras, R. Triebel, and R. Siegwart, “A layered approach to people detection in 3d range data,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2010, pp. 1625–1630.
  25. M. H. Abdelpakey and M. S. Shehata, “Dp-siam: Dynamic policy siamese network for robust object tracking,” IEEE Transactions on Image Processing, vol. 29, pp. 1479–1492, 2019.
  26. S. Chan, J. Tao, X. Zhou, C. Bai, and X. Zhang, “Siamese implicit region proposal network with compound attention for visual tracking,” IEEE Transactions on Image Processing, vol. 31, pp. 1882–1894, 2022.
  27. T. Xu, Z. Feng, X.-J. Wu, and J. Kittler, “Toward robust visual object tracking with independent target-agnostic detection and effective siamese cross-task interaction,” IEEE Transactions on Image Processing, vol. 32, pp. 1541–1554, 2023.
  28. Z. Liu, X. Wang, Y. Zhong, M. Shu, and C. Sun, “Siamhyper: Learning a hyperspectral object tracker from an rgb-based tracker,” IEEE Transactions on Image Processing, vol. 31, pp. 7116–7129, 2022.
  29. S. Giancola, J. Zarzar, and B. Ghanem, “Leveraging shape completion for 3d siamese tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1359–1368.
  30. J. Shan, S. Zhou, Z. Fang, and Y. Cui, “Ptt: Point-track-transformer module for 3d single object tracking in point clouds,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021, pp. 1310–1316.
  31. J. Nie, Z. He, Y. Yang, M. Gao, and J. Zhang, “Glt-t: Global-local transformer voting for 3d single object tracking in point clouds,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2023, pp. 1957–1965.
  32. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  33. T. Ma, M. Wang, J. Xiao, H. Wu, and Y. Liu, “Synchronize feature extracting and matching: A single branch framework for 3d object tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9953–9963.
  34. T.-X. Xu, Y.-C. Guo, Y.-K. Lai, and S.-H. Zhang, “Mbptrack: Improving 3d point cloud tracking with memory networks and box priors,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9911–9920.
  35. C. Zheng, X. Yan, H. Zhang, B. Wang, S. Cheng, S. Cui, and Z. Li, “An effective motion-centric paradigm for 3d single object tracking in point clouds,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  36. J. Zarzar, S. Giancola, and B. Ghanem, “Efficient bird eye view proposals for 3d siamese tracking,” arXiv preprint arXiv:1903.10168, 2019.
  37. Y. Cui, Z. Fang, J. Shan, Z. Gu, and S. Zhou, “3d object tracking with transformer,” arXiv preprint arXiv:2110.14921, 2021.
  38. L. Hui, L. Wang, M. Cheng, J. Xie, and J. Yang, “3d siamese voxel-to-bev tracker for sparse point clouds,” Advances in Neural Information Processing Systems, vol. 34, pp. 28 714–28 727, 2021.
  39. Y. Cui, Z. Li, and Z. Fang, “Sttracker: Spatio-temporal tracker for 3d single object tracking,” IEEE Robotics and Automation Letters, 2023.
  40. J. Deng, S. Shi, P. Li, W. Zhou, Y. Zhang, and H. Li, “Voxel r-cnn: Towards high performance voxel-based 3d object detection,” in Proceedings of the AAAI conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1201–1209.
  41. J. Mao, M. Niu, H. Bai, X. Liang, H. Xu, and C. Xu, “Pyramid r-cnn: Towards better performance and adaptability for 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2723–2732.
  42. S. Shi, C. Guo, L. Jiang, Z. Wang, J. Shi, X. Wang, and H. Li, “Pv-rcnn: Point-voxel feature set abstraction for 3d object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 529–10 538.
  43. H. Yang, Z. Liu, X. Wu, W. Wang, W. Qian, X. He, and D. Cai, “Graph r-cnn: Towards accurate 3d object detection with semantic-decorated local graph,” in European Conference on Computer Vision.   Springer, 2022, pp. 662–679.
  44. A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 12 697–12 705.
  45. K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, “Centernet: Keypoint triplets for object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6569–6578.
  46. H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng et al., “Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  47. Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia, “Voxelnext: Fully sparse voxelnet for 3d object detection and tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 21 674–21 683.
  48. S. Shi, X. Wang, and H. Li, “Pointrcnn: 3d object proposal generation and detection from point cloud,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 770–779.
  49. A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in 2016 IEEE International Conference on Image Processing (ICIP).   IEEE, 2016, pp. 3464–3468.
  50. N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE International Conference on Image Processing (ICIP).   IEEE, 2017, pp. 3645–3649.
  51. X. Weng, J. Wang, D. Held, and K. Kitani, “3d multi-object tracking: A baseline and new evaluation metrics,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020, pp. 10 359–10 366.
  52. H.-k. Chiu, A. Prioletti, J. Li, and J. Bohg, “Probabilistic 3d multi-object tracking for autonomous driving,” arXiv preprint arXiv:2001.05673.
  53. M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun, “Pnpnet: End-to-end perception and prediction with tracking in the loop,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 553–11 562.
  54. H. Wu, W. Han, C. Wen, X. Li, and C. Wang, “3d multi-object tracking in point clouds based on prediction confidence-guided data association,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 6, pp. 5668–5677, 2021.
  55. C. Luo, X. Yang, and A. Yuille, “Exploring simple 3d multi-object tracking for autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 488–10 497.
  56. B. Graham and L. Van der Maaten, “Submanifold sparse convolutional networks,” arXiv preprint arXiv:1706.01307, 2017.
  57. B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 9224–9232.
  58. J. Li, S. Bian, A. Zeng, C. Wang, B. Pang, W. Liu, and C. Lu, “Human pose regression with residual log-likelihood estimation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 025–11 034.
  59. L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” in Proceedings of the International Conference on Learning Representations, 2016.
  60. A. L. Maas, A. Y. Hannun, A. Y. Ng et al., “Rectifier nonlinearities improve neural network acoustic models,” in Proceedings of the International Conference on Machine Learning, vol. 30, 2013.
  61. M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. Pflugfelder, G. Fernandez, G. Nebehay, F. Porikli, and L. Cehovin, “A novel performance evaluation methodology for single-target trackers,” IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 2137–2155, 2016.
  62. Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,” in Proceedings of the IEEE International Conference on Robotics and Automation, 2023, pp. 2774–2781.
  63. X. Bai, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai, “Transfusion: Robust lidar-camera fusion for 3d object detection with transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1090–1099.
  64. T. Liang, H. Xie, K. Yu, Z. Xia, Z. Lin, Y. Wang, T. Tang, B. Wang, and Z. Tang, “Bevfusion: A simple and robust lidar-camera fusion framework,” Advances in Neural Information Processing Systems, vol. 35, pp. 10 421–10 434, 2022.
Citations (2)

Summary

We haven't generated a summary for this paper yet.