Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Visual Point Cloud Forecasting enables Scalable Autonomous Driving (2312.17655v1)

Published 29 Dec 2023 in cs.CV

Abstract: In contrast to extensive studies on general vision, pre-training for scalable visual autonomous driving remains seldom explored. Visual autonomous driving applications require features encompassing semantics, 3D geometry, and temporal information simultaneously for joint perception, prediction, and planning, posing dramatic challenges for pre-training. To resolve this, we bring up a new pre-training task termed as visual point cloud forecasting - predicting future point clouds from historical visual input. The key merit of this task captures the synergic learning of semantics, 3D structures, and temporal dynamics. Hence it shows superiority in various downstream tasks. To cope with this new problem, we present ViDAR, a general model to pre-train downstream visual encoders. It first extracts historical embeddings by the encoder. These representations are then transformed to 3D geometric space via a novel Latent Rendering operator for future point cloud prediction. Experiments show significant gain in downstream tasks, e.g., 3.1% NDS on 3D detection, ~10% error reduction on motion forecasting, and ~15% less collision rate on planning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (91)
  1. StretchBEV: Stretching Future Instance Prediction Spatially and Temporally. In ECCV, 2022.
  2. A Cookbook of Self-supervised Learning. arXiv preprint arXiv:2304.12210, 2023.
  3. SemanticKITTI: A Dataset for Semantic Scene Understanding of LiDAR Sequences. In ICCV, 2019.
  4. Range Conditioned Dilated Convolutions for Scale Invariant 3D Object Detection. arXiv preprint arXiv:2005.09927, 2021.
  5. nuScenes: A Multimodal Dataset for Autonomous Driving. In CVPR, 2020.
  6. PersFormer: 3D Lane Detection via Perspective Transformer and the OpenLane Benchmark. In ECCV, 2022.
  7. End-to-end Autonomous Driving: Challenges and Frontiers. arXiv preprint arXiv:2306.16927, 2023.
  8. Improved Baselines with Momentum Contrastive Learning. arXiv preprint arXiv:2003.04297, 2020.
  9. MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d, 2020.
  10. OpenScene Contributors. OpenScene: The Largest Up-to-Date 3D Occupancy Prediction Benchmark in Autonomous Driving, 2023.
  11. Deformable Convolutional Networks. In ICCV, 2017.
  12. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  13. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.
  14. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In ICLR, 2021.
  15. Large Scale Interactive Motion Forecasting for Autonomous Driving: The Waymo Open Motion Dataset. In ICCV, 2021.
  16. Vision meets robotics: The KITTI dataset. I. J. Robotics Res., 2013.
  17. ViP3D: End-to-end visual trajectory prediction via 3d agent queries. In CVPR, 2023.
  18. A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends. arXiv preprint arXiv:2301.05712, 2023.
  19. Deep Residual Learning for Image Recognition. In CVPR, 2016.
  20. Momentum Contrast for Unsupervised Visual Representation Learning. In CVPR, 2020.
  21. Masked Autoencoders Are Scalable Vision Learners. In CVPR, 2022.
  22. Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts. In CVPR, 2021.
  23. FIERY: Future Instance Segmentation in Bird’s-Eye view from Surround Monocular Cameras. In ICCV, 2021a.
  24. GAIA-1: A Generative World Model for Autonomous Driving. arXiv preprint arXiv:2309.17080, 2023a.
  25. Monocular Quasi-Dense 3D Object Tracking. TPAMI, 2022a.
  26. Safe Local Motion Planning with Self-Supervised Freespace Forecasting. In CVPR, 2021b.
  27. ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning. In ECCV, 2022b.
  28. Planning-oriented Autonomous Driving. In CVPR, 2023b.
  29. BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View. arXiv preprint arXiv:2112.11790, 2021.
  30. Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction. In CVPR, 2023.
  31. DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving. In ICCV, 2023a.
  32. Think Twice before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving. In CVPR, 2023b.
  33. VAD: Vectorized Scene Representation for Efficient Autonomous Driving. In ICCV, 2023a.
  34. Self-supervised Pre-training with Masked Shape Prediction for 3D Scene Understanding. In CVPR, 2023b.
  35. Supervised Contrastive Learning. arXiv preprint arXiv:2004.11362, 2020.
  36. Differentiable Raycasting for Self-Supervised Occupancy Forecasting. In ECCV, 2022.
  37. Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting. In CVPR, 2023.
  38. Adam: A Method for Stochastic Optimization. In ICLR, 2015.
  39. Open-sourced Data Ecosystem in Autonomous Driving: the Present and Future. arXiv preprint arXiv:2312.03408, 2023a.
  40. Delving Into the Devils of Bird’s-Eye-View Perception: A Review, Evaluation and Recipe. TPAMI, 2023b.
  41. Graph-based Topology Reasoning for Driving Scenes. arXiv preprint arXiv:2304.05277, 2023c.
  42. DrivingDiffusion: Layout-Guided multi-view driving scene video generation with latent diffusion model. arXiv preprint arXiv:2310.07771, 2023d.
  43. Unifying Voxel-based Representation with Transformer for 3D Object Detection. In NeurIPS, 2022a.
  44. BEVStereo: Enhancing Depth Estimation in Multi-View 3D Object Detection with Temporal Stereo. In AAAI, 2023e.
  45. BEVDepth: Acquisition of Reliable Depth for Multi-view 3D Object Detection. In AAAI, 2023f.
  46. End-to-end 3D Tracking with Decoupled Queries. In ICCV, 2023g.
  47. BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers. In ECCV, 2022b.
  48. FB-BEV: BEV Representation from Forward-Backward View Transformations. ICCV, 2023h.
  49. PnPNet: End-to-End Perception and Prediction With Tracking in the Loop. In CVPR, 2020.
  50. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
  51. Feature Pyramid Networks for Object Detection. In CVPR, 2017.
  52. SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos. arXiv preprint arXiv:2308.09244, 2023a.
  53. PETR: Position Embedding Transformation for Multi-View 3D Object Detection. In ECCV, 2022.
  54. PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images. In ICCV, 2023b.
  55. BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation. In ICRA, 2023c.
  56. Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101, 2019.
  57. Self-supervised Point Cloud Prediction Using 3D Spatio-temporal Convolutional Networks. In CoRL, 2021.
  58. LaserNet: An Efficient Probabilistic 3D Object Detector for Autonomous Driving. In CVPR, 2019.
  59. Occupancy-MAE: Self-Supervised Pre-Training Large-Scale LiDAR Point Clouds With Masked Occupancy Autoencoders. TIV, 2023.
  60. SimpleTrack: Understanding and Rethinking 3D Multi-object Tracking. arXiv preprint arXiv:2111.09621, 2021.
  61. Is Pseudo-Lidar Needed for Monocular 3D Object Detection? In ICCV, 2021.
  62. Categorical Depth DistributionNetwork for Monocular 3D Object Detection. In CVPR, 2021.
  63. To Compress or Not to Compress–Self-Supervised Learning and Information Theory: A Review. arXiv preprint arXiv:2304.09355, 2023.
  64. DriveLM: Driving with Graph Visual Question Answering. arXiv preprint arXiv:2312.14150, 2023.
  65. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In CVPR, 2020.
  66. Vision-based Intention and Trajectory Prediction in Autonomous Vehicles: A Survey. In IJCAI, 2022.
  67. Contrastive Multiview Coding. In ECCV, 2020.
  68. Scene as Occupancy. In ICCV, 2023.
  69. OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping. In NeurIPS Datasets and Benchmarks, 2023a.
  70. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. In ICCV Workshops, 2021a.
  71. MV-FCOS3D++: Multi-View Camera-Only 4D Object Detection with Pretrained Monocular Backbones. arXiv preprint arXiv:2207.12716, 2022.
  72. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. In CVPR, 2023b.
  73. DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving. arXiv preprint arXiv:2309.09777, 2023c.
  74. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. In CoRL, 2021b.
  75. Masked Feature Prediction for Self-Supervised Visual Pre-Training. In CVPR, 2022.
  76. SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving. In ICCV, 2023.
  77. Inverting the Pose Forecasting Pipeline with SPF2: Sequential Pointcloud Forecasting for Sequential Pose Forecasting. arXiv preprint arXiv:2003.08376, 2020.
  78. S2Net: Stochastic Sequential Pointcloud Forecasting. In ECCV, 2022.
  79. Trajectory-guided Control Prediction for End-to-end Autonomous Driving: A Simple yet Strong Baseline. In NeurIPS, 2022.
  80. Policy Pre-training for Autonomous Driving via Self-supervised Geometric Modeling. In ICLR, 2023.
  81. Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In CVPR, 2018.
  82. PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding. In ECCV, 2020.
  83. SimMIM: A Simple Framework for Masked Image Modeling. In CVPR, 2022.
  84. SPOT: Scalable 3D Pre-training via Occupancy Prediction for Autonomous Driving. arXiv preprint arXiv:2309.10527, 2023.
  85. BEVFormer v2: Adapting Modern Image Backbones to Bird’s-Eye-View Recognition via Perspective Supervision. In CVPR, 2023a.
  86. UniPAD: A Universal Pre-training Paradigm for Autonomous Driving. arXiv preprint arXiv:2310.08370, 2023b.
  87. Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection. In CVPR, 2023.
  88. Learning unsupervised world models for autonomous driving via discrete diffusion. arXiv preprint arXiv:2311.01017, 2023.
  89. MUTR3D: A Multi-camera Tracking Framework via 3D-to-2D Queries. In CVPR, 2022a.
  90. BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving. arXiv preprint arXiv:2205.09743, 2022b.
  91. Deformable DETR: Deformable Transformers for End-to-End Object Detection. In ICLR, 2020.
Citations (28)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets