Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Inverse Neural Rendering for Explainable Multi-Object Tracking (2404.12359v1)

Published 18 Apr 2024 in cs.CV, cs.GR, and cs.RO

Abstract: Today, most methods for image understanding tasks rely on feed-forward neural networks. While this approach has allowed for empirical accuracy, efficiency, and task adaptation via fine-tuning, it also comes with fundamental disadvantages. Existing networks often struggle to generalize across different datasets, even on the same task. By design, these networks ultimately reason about high-dimensional scene features, which are challenging to analyze. This is true especially when attempting to predict 3D information based on 2D images. We propose to recast 3D multi-object tracking from RGB cameras as an \emph{Inverse Rendering (IR)} problem, by optimizing via a differentiable rendering pipeline over the latent space of pre-trained 3D object representations and retrieve the latents that best represent object instances in a given input image. To this end, we optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We investigate not only an alternate take on tracking but our method also enables examining the generated objects, reasoning about failure situations, and resolving ambiguous cases. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data and assessing camera-based 3D tracking on the nuScenes and Waymo datasets. Both these datasets are completely unseen to our method and do not require fine-tuning. Videos and code are available at https://light.princeton.edu/inverse-rendering-tracking/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
  2. Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-aware semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2359–2367, 2017.
  3. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Semantic image segmentation with deep convolutional nets and fully connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
  4. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  5. R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, pp. 1440–1448, 2015.
  6. J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788, 2016.
  7. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision, pp. 21–37, Springer, 2016.
  8. J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L. Waslander, “Joint 3d proposal generation and object detection from view aggregation,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8, IEEE, 2018.
  9. C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 918–927, 2018.
  10. Y. Zhou and O. Tuzel, “Voxelnet: End-to-end learning for point cloud based 3d object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4490–4499, 2018.
  11. S. Sharma, J. A. Ansari, J. K. Murthy, and K. M. Krishna, “Beyond pixels: Leveraging geometry and shape cues for online multi-object tracking,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 3508–3515, IEEE, 2018.
  12. A. Kim, A. Ošep, and L. Leal-Taixé, “Eagermot: 3d multi-object tracking via sensor fusion,” in 2021 IEEE International Conference on Robotics and Automation (ICRA), pp. 11315–11321, IEEE, 2021.
  13. X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” in European Conference on Computer Vision, pp. 474–490, Springer, 2020.
  14. M. Chaabane, P. Zhang, J. R. Beveridge, and S. O’Hara, “Deft: Detection embeddings for tracking,” arXiv preprint arXiv:2102.02267, 2021.
  15. T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11784–11793, 2021.
  16. X. Weng, J. Wang, D. Held, and K. Kitani, “3d multi-object tracking: A baseline and new evaluation metrics,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10359–10366, IEEE, 2020.
  17. Z. Pang, Z. Li, and N. Wang, “Simpletrack: Understanding and rethinking 3d multi-object tracking,” arXiv preprint arXiv:2111.09621, 2021.
  18. C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3343–3352, 2019.
  19. Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes,” arXiv preprint arXiv:1711.00199, 2017.
  20. Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,” arXiv, 2022.
  21. X. Weng, Y. Wang, Y. Man, and K. M. Kitani, “Gnn3dmot: Graph neural network for 3d multi-object tracking with 2d-3d multi-feature learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6499–6508, 2020.
  22. Y. Chen, Z. Yu, Y. Chen, S. Lan, A. Anandkumar, J. Jia, and J. M. Alvarez, “Focalformer3d: Focusing on hard instance for 3d object detection,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8394–8405, 2023.
  23. B. Xuyang, Z. Hu, X. Zhu, Q. Huang, Y. Chen, H. Fu, and C.-L. Tai, “TransFusion: Robust Lidar-Camera Fusion for 3d Object Detection with Transformers,” CVPR, 2022.
  24. H.-N. Hu, Y.-H. Yang, T. Fischer, T. Darrell, F. Yu, and M. Sun, “Monocular quasi-dense 3d object tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  25. J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan, “Track to detect and segment: An online multi-object tracker,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12352–12361, 2021.
  26. N. Marinello, M. Proesmans, and L. Van Gool, “Triplettrack: 3d object tracking using triplet embeddings and lstm,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4500–4510, 2022.
  27. P. Nguyen, K. G. Quach, C. N. Duong, N. Le, X.-B. Nguyen, and K. Luu, “Multi-camera multiple 3d object tracking on the move for autonomous vehicles,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2569–2578, 2022.
  28. M. Gladkova, N. Korobov, N. Demmel, A. Ošep, L. Leal-Taixé, and D. Cremers, “Directtracker: 3d multi-object tracking using direct image alignment and photometric bundle adjustment,” arXiv preprint arXiv:2209.14965, 2022.
  29. J. Yang, E. Yu, Z. Li, X. Li, and W. Tao, “Quality matters: Embracing quality clues for robust 3d multi-object tracking,” arXiv preprint arXiv:2208.10976, 2022.
  30. Z. Pang, J. Li, P. Tokmakov, D. Chen, S. Zagoruyko, and Y.-X. Wang, “Standing between past and future: Spatio-temporal modeling for multi-camera 3d multi-object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  31. S. Wang, Y. Liu, T. Wang, Y. Li, and X. Zhang, “Exploring object-centric temporal modeling for efficient multi-view 3d object detection,” arXiv preprint arXiv:2303.11926, 2023.
  32. J. Gao, T. Shen, Z. Wang, W. Chen, K. Yin, D. Li, O. Litany, Z. Gojcic, and S. Fidler, “Get3d: A generative model of high quality 3d textured shapes learned from images,” in Advances In Neural Information Processing Systems, 2022.
  33. J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove, “Deepsdf: Learning continuous signed distance functions for shape representation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  34. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” 2020.
  35. B. Shen, X. Yan, C. R. Qi, M. Najibi, B. Deng, L. Guibas, Y. Zhou, and D. Anguelov, “Gina-3d: Learning to generate implicit neural assets in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4913–4926, June 2023.
  36. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631, 2020.
  37. P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2446–2454, 2020.
  38. A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” Acm computing surveys (CSUR), vol. 38, no. 4, pp. 13–es, 2006.
  39. Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2411–2418, 2013.
  40. A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah, “Visual tracking: An experimental survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 7, pp. 1442–1468, 2013.
  41. M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “Robust tracking-by-detection using a detector confidence particle filter,” in 2009 IEEE 12th International Conference on Computer Vision, pp. 1515–1522, IEEE, 2009.
  42. Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 7, pp. 1409–1422, 2011.
  43. A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in 2016 IEEE international conference on image processing (ICIP), pp. 3464–3468, IEEE, 2016.
  44. P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 941–951, 2019.
  45. N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649, IEEE, 2017.
  46. N. Wojke and A. Bewley, “Deep cosine metric learning for person re-identification,” in 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 748–756, IEEE, 2018.
  47. J. Cao, X. Weng, R. Khirodkar, J. Pang, and K. Kitani, “Observation-centric sort: Rethinking sort for robust multi-object tracking,” arXiv preprint arXiv:2203.14360, 2022.
  48. K. Huang and Q. Hao, “Joint multi-object detection and tracking with camera-lidar fusion for autonomous driving,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6983–6989, IEEE, 2021.
  49. A. Dewan, T. Caselitz, G. D. Tipaldi, and W. Burgard, “Motion-based detection and tracking in 3d lidar scans,” in 2016 IEEE international conference on robotics and automation (ICRA), pp. 4508–4513, IEEE, 2016.
  50. C. Álvarez-Aparicio, Á. M. Guerrero-Higueras, F. J. Rodríguez-Lera, J. Ginés Clavero, F. Martín Rico, and V. Matellán, “People detection and tracking using lidar sensors,” Robotics, vol. 8, no. 3, p. 75, 2019.
  51. A. Osep, W. Mehner, M. Mathias, and B. Leibe, “Combined image-and world-space tracking in traffic scenes,” in 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1988–1995, IEEE, 2017.
  52. S. Scheidegger, J. Benjaminsson, E. Rosenberg, A. Krishnan, and K. Granström, “Mono-camera 3d multi-object tracking using deep learning detections and pmbm filtering,” in 2018 IEEE Intelligent Vehicles Symposium (IV), pp. 433–440, IEEE, 2018.
  53. S. Chen, “Kalman filter for robot vision: a survey,” IEEE Transactions on industrial electronics, vol. 59, no. 11, pp. 4409–4420, 2011.
  54. H. T. Nguyen and A. W. Smeulders, “Fast occluded object tracking by a robust appearance filter,” IEEE transactions on pattern analysis and machine intelligence, vol. 26, no. 8, pp. 1099–1104, 2004.
  55. R. E. Kalman, “A new approach to linear filtering and prediction problems,” 1960.
  56. J. Luiten, T. Fischer, and B. Leibe, “Track to reconstruct and reconstruct to track,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1803–1810, 2020.
  57. J. Mao, S. Shi, X. Wang, and H. Li, “3d object detection for autonomous driving: A review and new outlooks,” arXiv preprint arXiv:2206.09474, 2022.
  58. D. Beker, H. Kato, M. A. Morariu, T. Ando, T. Matsuoka, W. Kehl, and A. Gaidon, “Monocular differentiable rendering for self-supervised 3d object detection,” in European Conference on Computer Vision, pp. 514–529, Springer, 2020.
  59. J. Ku, A. D. Pon, and S. L. Waslander, “Monocular 3d object detection leveraging accurate proposals and shape reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11867–11876, 2019.
  60. T. He and S. Soatto, “Mono3d++: Monocular 3d vehicle detection with two-scale 3d hypotheses and task priors,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8409–8416, 2019.
  61. Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Data-driven 3d voxel patterns for object category recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1903–1911, 2015.
  62. J. Ost, F. Mannan, N. Thuerey, J. Knodt, and F. Heide, “Neural scene graphs for dynamic scenes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2856–2865, 2021.
  63. S. Zakharov, W. Kehl, A. Bhargava, and A. Gaidon, “Autolabeling 3d objects with differentiable rendering of sdf shape priors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12224–12233, 2020.
  64. J. Thies, M. Zollhöfer, and M. Nießner, “Deferred neural rendering,” ACM Transactions on Graphics, vol. 38, p. 1–12, Jul 2019.
  65. V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetzstein, and M. Zollhöfer, “Deepvoxels: Learning persistent 3d feature embeddings,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  66. W. Yuan, Z. Lv, T. Schmidt, and S. Lovegrove, “Star: Self-supervised tracking and reconstruction of rigid objects in motion with neural rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13144–13152, 2021.
  67. K. Park, U. Sinha, J. T. Barron, S. Bouaziz, D. B. Goldman, S. M. Seitz, and R. Martin-Brualla, “Nerfies: Deformable neural radiance fields,” Proceedings of the IEEE International Conference on Computer Vision, 2021.
  68. P. Kellnhofer, L. C. Jebe, A. Jones, R. Spicer, K. Pulli, and G. Wetzstein, “Neural lumigraph rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4287–4297, 2021.
  69. G. Chou, I. Chugunov, and F. Heide, “Gensdf: Two-stage learning of generalizable signed distance functions,” in Proc. of Neural Information Processing Systems (NeurIPS), 2022.
  70. F. Xiang, Z. Xu, M. Hasan, Y. Hold-Geoffroy, K. Sunkavalli, and H. Su, “Neutex: Neural texture mapping for volumetric neural rendering,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7119–7128, 2021.
  71. L. Koestler, D. Grittner, M. Moeller, D. Cremers, and Z. Lähner, “Intrinsic neural fields: Learning functions on manifolds,” arXiv preprint arXiv:2203.07967, vol. 2, 2022.
  72. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4401–4410, 2019.
  73. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8110–8119, 2020.
  74. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  75. A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in International Conference on Machine Learning, pp. 8162–8171, PMLR, 2021.
  76. Z. Hao, A. Mallya, S. Belongie, and M.-Y. Liu, “Gancraft: Unsupervised 3d neural rendering of minecraft worlds,” arXiv preprint arXiv:2104.07659, 2021.
  77. Z. Wang, S. Wu, W. Xie, M. Chen, and V. A. Prisacariu, “Nerf–: Neural radiance fields without known camera parameters,” arXiv preprint arXiv:2102.07064, 2021.
  78. L. Yen-Chen, P. Florence, J. T. Barron, A. Rodriguez, P. Isola, and T.-Y. Lin, “inerf: Inverting neural radiance fields for pose estimation,” in 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1323–1330, IEEE, 2021.
  79. C.-H. Lin, W.-C. Ma, A. Torralba, and S. Lucey, “Barf: Bundle-adjusting neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5741–5751, 2021.
  80. M. Nimier-David, Z. Dong, W. Jakob, and A. Kaplanyan, “Material and Lighting Reconstruction for Complex Indoor Scenes with Texture-space Differentiable Rendering,” in Eurographics Symposium on Rendering - DL-only Track (A. Bousseau and M. McGuire, eds.), The Eurographics Association, 2021.
  81. Y.-C. Guo, D. Kang, L. Bao, Y. He, and S.-H. Zhang, “Nerfren: Neural radiance fields with reflections,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18409–18418, 2022.
  82. M. Nimier-David, D. Vicini, T. Zeltner, and W. Jakob, “Mitsuba 2: A retargetable forward and inverse renderer,” Transactions on Graphics (Proceedings of SIGGRAPH Asia), vol. 38, Dec. 2019.
  83. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 586–595, 2018.
  84. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” 2015.
  85. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2015.
  86. H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  87. X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” in arXiv preprint arXiv:1904.07850, 2019.
  88. T. Shen, J. Gao, K. Yin, M.-Y. Liu, and S. Fidler, “Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 6087–6101, 2021.
  89. S. Laine, J. Hellsten, T. Karras, Y. Seol, J. Lehtinen, and T. Aila, “Modular primitives for high-performance differentiable rendering,” ACM Transactions on Graphics (TOG), vol. 39, no. 6, pp. 1–14, 2020.
  90. A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi, and F. Yu, “ShapeNet: An Information-Rich 3D Model Repository,” Tech. Rep. arXiv:1512.03012 [cs.GR], Stanford University — Princeton University — Toyota Technological Institute at Chicago, 2015.
  91. K. Bernardin, A. Elbs, and R. Stiefelhagen, “Multiple object tracking performance metrics and evaluation in a smart room environment,” in Sixth IEEE International Workshop on Visual Surveillance, in conjunction with ECCV, vol. 90, Citeseer, 2006.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets