Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoST: Multi-modality Scene Tokenization for Motion Prediction (2404.19531v1)

Published 30 Apr 2024 in cs.CV

Abstract: Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories, such as bounding boxes, road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world, which may render the motion prediction model vulnerable to perception errors (e.g., failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g., poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However, this approach suffers from the lack of interpretability and requires significantly more training resources. In this work, we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method, we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Prank: motion prediction based on ranking. Advances in neural information processing systems, 33:2553–2563, 2020.
  2. Plop: Probabilistic polynomial objects trajectory planning for autonomous driving. arXiv preprint arXiv:2003.08744, 2020.
  3. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  4. Intentnet: Learning to predict intention from raw sensor data. In CoRL, 2018.
  5. Spagnn: Spatially-aware graph neural networks for relational behavior forecasting from sensor data. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9491–9497. IEEE, 2020.
  6. Mp3: A unified model to map, perceive, predict and plan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14403–14412, 2021.
  7. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. In CoRL, 2019a.
  8. Multipath: Multiple probabilistic anchor trajectory hypotheses for behavior prediction. In CoRL, 2019b.
  9. Argoverse: 3d tracking and forecasting with rich maps. In CVPR, 2019.
  10. Womd-lidar: Raw sensor dataset benchmark for motion forecasting. arXiv preprint arXiv:2304.03834, 2023a.
  11. End-to-end autonomous driving: Challenges and frontiers. arXiv preprint arXiv:2306.16927, 2023b.
  12. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  13. Multimodal trajectory predictions for autonomous driving using deep convolutional networks. In ICRA, 2019.
  14. Multixnet: Multiclass multistage multimodal motion prediction. In 2021 IEEE Intelligent Vehicles Symposium (IV), pages 435–442. IEEE, 2021.
  15. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021.
  16. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In ICCV, 2021a.
  17. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In ICCV, 2021b.
  18. Multi-view fusion of sensor data for improved perception and prediction in autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2349–2357, 2022.
  19. Vectornet: Encoding hd maps and agent dynamics from vectorized representation. In CVPR, 2020.
  20. Home: Heatmap output for future motion estimation. In 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), pages 500–507. IEEE, 2021.
  21. Gohome: Graph-oriented heatmap output for future motion estimation. In 2022 international conference on robotics and automation (ICRA), pages 9107–9114. IEEE, 2022.
  22. Densetnt: End-to-end trajectory prediction from dense goal sets. In ICCV, 2021.
  23. Vip3d: End-to-end visual trajectory prediction via 3d agent queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5496–5506, 2023.
  24. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research. arXiv preprint arXiv:2310.08710, 2023.
  25. Social gan: Socially acceptable trajectories with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2255–2264, 2018.
  26. Rethinking integration of prediction and planning in deep learning-based automated driving systems: A review. arXiv preprint arXiv:2308.05731, 2023.
  27. Rules of the road: Predicting driving behavior with a convolutional model of semantic interactions. In CVPR, 2019.
  28. Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15273–15282, 2021.
  29. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. In European Conference on Computer Vision, pages 533–549. Springer, 2022.
  30. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
  31. Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7953–7963, 2023a.
  32. Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding. IEEE transactions on pattern analysis and machine intelligence, 2023b.
  33. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21983–21994, 2023c.
  34. Predictionnet: Real-time joint probabilistic traffic prediction for planning, control, and simulation. In 2022 International Conference on Robotics and Automation (ICRA), pages 8936–8942. IEEE, 2022.
  35. What-if motion prediction for autonomous driving. arXiv preprint arXiv:2008.10587, 2020.
  36. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  37. Motioncnn: A strong baseline for motion prediction in autonomous driving. In CVPRW, 2021.
  38. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
  39. Learning lane graph representations for motion forecasting. In ECCV, pages 541–556. Springer, 2020a.
  40. Pnpnet: End-to-end perception and prediction with tracking in the loop. In CVPR, 2020b.
  41. Deep structured reactive planning. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 4897–4904. IEEE, 2021.
  42. Less: Label-efficient semantic segmentation for lidar point clouds. In European conference on computer vision, pages 70–89. Springer, 2022.
  43. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017.
  44. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3569–3577, 2018.
  45. Jfp: Joint future prediction with interactive multi-agent modeling for autonomous driving. In Conference on Robot Learning, pages 1457–1467. PMLR, 2023.
  46. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
  47. Mantra: Memory augmented networks for multiple trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7143–7152, 2020.
  48. Wayformer: Motion forecasting via simple & efficient attention networks. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 2980–2987. IEEE, 2023.
  49. Scene transformer: A unified architecture for predicting multiple agent trajectories. arXiv preprint arXiv:2106.08417, 2021.
  50. Scene transformer: A unified architecture for predicting multiple agent trajectories. In ICLR, 2022.
  51. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  52. Diverse and admissible trajectory forecasting through multimodal context understanding. In ECCV, pages 282–298. Springer, 2020.
  53. Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. Advances in neural information processing systems, 1, 1988.
  54. Offboard 3d object detection from point cloud sequences. In CVPR, 2021.
  55. Learning transferable visual models from natural language supervision. In ICML, 2021.
  56. Precog: Prediction conditioned on goals in visual multi-agent settings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2821–2830, 2019.
  57. Perceive, predict, and plan: Safe motion planning through interpretable semantic representations. In ECCV, pages 414–430. Springer, 2020.
  58. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In ECCV, pages 683–700. Springer, 2020.
  59. Motionlm: Multi-agent motion forecasting as language modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8579–8590, 2023.
  60. Motion transformer with global intention localization and local movement refinement. Advances in Neural Information Processing Systems, 35:6531–6543, 2022.
  61. Pip: Planning-informed trajectory prediction for autonomous driving. In ECCV, pages 598–614. Springer, 2020.
  62. M2i: From factored marginal trajectory prediction to interactive prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6543–6552, 2022.
  63. Multiple futures prediction. Advances in neural information processing systems, 32, 2019.
  64. Identifying driver interactions via conditional behavior prediction. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 3473–3479. IEEE, 2021.
  65. Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction. CoRR, abs/2111.14973, 2021.
  66. Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction. In 2022 International Conference on Robotics and Automation (ICRA), pages 7814–7821, 2022.
  67. Bevgpt: Generative pre-trained large model for autonomous driving prediction, decision-making, and planning. arXiv preprint arXiv:2310.10357, 2023.
  68. Argoverse 2: Next generation datasets for self-driving perception and forecasting. arXiv preprint arXiv:2301.00493, 2023.
  69. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023.
  70. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  71. End-to-end interpretable neural motion planner. In CVPR, 2019.
  72. Mutr3d: A multi-camera tracking framework via 3d-to-2d queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4537–4546, 2022a.
  73. Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving. arXiv preprint arXiv:2205.09743, 2022b.
  74. Voxelnet: End-to-end learning for point cloud based 3d object detection. In CVPR, 2018.
  75. End-to-end multi-view fusion for 3d object detection in lidar point clouds. In CoRL, 2020.
Citations (4)

Summary

  • The paper introduces scene tokenization by fusing multi-modal sensor data, enabling context-aware motion prediction.
  • It employs image foundation models and LiDAR networks to generate compact scene tokens for transformer-based motion forecasting.
  • Experiments on the Waymo Open Motion Dataset show a 10.3% soft-mAP boost and a 6.6% reduction in minADE over baseline approaches.

An Expert Overview of MoST: Multi-modality Scene Tokenization for Motion Prediction

The paper "MoST: Multi-modality Scene Tokenization for Motion Prediction" proposes a novel approach to enhancing motion prediction for autonomous systems through advanced sensor data representation. The authors, from Waymo LLC, present a methodology that tokenizes the visual scene elements from raw sensor inputs, combining insights from pre-trained image models and LiDAR data to improve motion prediction accuracy and robustness.

Methodological Innovation

Traditionally, motion prediction models in autonomous systems rely on symbolic perception outputs such as bounding boxes and road graphs. However, these symbolic outputs present challenges including a lack of context sensitivity and vulnerability to perception errors. As an alternative, the authors introduce the concept of scene tokenization that efficiently combines symbolic representations with multi-modality data from raw sensors.

The MoST approach employs image foundation models to extract generalized visual world knowledge and LiDAR neural networks to capture geometric representations. These collective insights are encoded into a set of scene tokens—compact representations of multi-modality observations suitable for transformer-based architectures. This methodology facilitates an open-vocabulary scene understanding, allowing for the interpretation of previously unrecognized objects and conditions (such as open-vocabulary obstacles or poor road conditions).

Experimental Foundation

The authors base their evaluations on the Waymo Open Motion Dataset (WOMD) augmented with camera embeddings. The dataset incorporates LiDAR and image data, establishing a robust benchmark for one second of historical data leveraged for predicting future trajectories.

Experimental results demonstrate notable improvements over state-of-the-art baseline approaches. The MoST method enhances prediction metrics such as soft mean Average Precision (soft-mAP) by 10.3% and minimum Average Displacement Error (minADE) by 6.6%. These findings underscore the importance of leveraging multi-modality data fused into enriched scene tokens.

Implications and Future Directions

The practical implications of MoST are significant in the field of autonomous driving and robotics. By bridging the gap between raw sensor data and symbolic perception, MoST offers a refined framework that aligns well with the real-world complexities encountered by autonomous systems.

In terms of future directions, the use of large pre-trained models for scene tokenization presents potential expansion across various autonomous system applications, from robotics to advanced driver-assistance systems (ADAS). The adaptive and versatile nature of MoST might also extend to enhancing interpretability and robustness in other domains reliant on complex scene understanding.

Overall, the MoST method serves as a compelling alternative to current motion prediction standards, offering a structured yet multifaceted approach to improving the interaction of autonomous systems with their environments. As transformer-based frameworks continue to evolve, integrating such multi-modality scene tokenization strategies could become a key component in advancing the efficacy and safety of autonomous technologies.

Youtube Logo Streamline Icon: https://streamlinehq.com