Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
132 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Trajeglish: Traffic Modeling as Next-Token Prediction (2312.04535v2)

Published 7 Dec 2023 in cs.LG and cs.RO

Abstract: A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs. In pursuit of this functionality, we apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios. Using a simple data-driven tokenization scheme, we discretize trajectories to centimeter-level resolution using a small vocabulary. We then model the multi-agent sequence of discrete motion tokens with a GPT-like encoder-decoder that is autoregressive in time and takes into account intra-timestep interaction between agents. Scenarios sampled from our model exhibit state-of-the-art realism; our model tops the Waymo Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%. We ablate our modeling choices in full autonomy and partial autonomy settings, and show that the representations learned by our model can quickly be adapted to improve performance on nuScenes. We additionally evaluate the scalability of our model with respect to parameter count and dataset size, and use density estimates from our model to quantify the saliency of context length and intra-timestep interaction for the traffic modeling task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. K-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA ’07, pp.  1027–1035, USA, 2007. Society for Industrial and Applied Mathematics. ISBN 9780898716245.
  2. nuscenes: A multimodal dataset for autonomous driving. CoRR, abs/1903.11027, 2019. URL http://arxiv.org/abs/1903.11027.
  3. Robert L. Cook. Stochastic sampling in computer graphics. ACM Trans. Graph., 5(1):51–72, jan 1986. ISSN 0730-0301. doi: 10.1145/7529.8927. URL https://doi.org/10.1145/7529.8927.
  4. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022.
  5. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  9710–9719, October 2021.
  6. Vectornet: Encoding hd maps and agent dynamics from vectorized representation, 2020.
  7. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH.
  8. Gaia-1: A generative world model for autonomous driving, 2023.
  9. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021. URL https://arxiv.org/abs/2106.09685.
  10. Symphony: Learning realistic and diverse agents for autonomous driving simulation, 2022.
  11. Offline reinforcement learning as one big sequence modeling problem. In Advances in Neural Information Processing Systems, 2021.
  12. Motiondiffuser: Controllable multi-agent motion prediction using diffusion, 2023.
  13. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  14. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. URL http://arxiv.org/abs/1711.05101.
  15. Imitation is not enough: Robustifying imitation with reinforcement learning for challenging driving scenarios, 2023.
  16. The waymo open sim agents challenge, 2023.
  17. Wayformer: Motion forecasting via simple and efficient attention networks, 2022.
  18. Scene transformer: A unified multi-task model for behavior prediction and planning. CoRR, abs/2106.08417, 2021. URL https://arxiv.org/abs/2106.08417.
  19. Jonah Philion. Fastdraw: Addressing the long tail of lane detection by adapting a sequential prediction network. CoRR, abs/1905.04354, 2019. URL http://arxiv.org/abs/1905.04354.
  20. Using the output embedding to improve language models, 2017.
  21. Language models are unsupervised multitask learners. 2019.
  22. Exploring the limits of transfer learning with a unified text-to-text transformer. CoRR, abs/1910.10683, 2019. URL http://arxiv.org/abs/1910.10683.
  23. Sequence level training with recurrent neural networks. In Yoshua Bengio and Yann LeCun (eds.), 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1511.06732.
  24. Generating useful accident-prone driving scenarios via a learned traffic prior. CoRR, abs/2112.05077, 2021. URL https://arxiv.org/abs/2112.05077.
  25. Efficient reductions for imitation learning. In Yee Whye Teh and Mike Titterington (eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pp.  661–668, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. URL https://proceedings.mlr.press/v9/ross10a.html.
  26. Motionlm: Multi-agent motion forecasting as language modeling, 2023.
  27. Motion transformer with global intention localization and local movement refinement. Advances in Neural Information Processing Systems, 2022.
  28. Mtr++: Multi-agent motion prediction with symmetric scene modeling and guided intention querying. arXiv preprint arXiv:2306.17770, 2023.
  29. Trafficsim: Learning to simulate realistic multi-agent behaviors, 2021.
  30. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499, 2016. URL http://arxiv.org/abs/1609.03499.
  31. Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction. CoRR, abs/2111.14973, 2021. URL https://arxiv.org/abs/2111.14973.
  32. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  33. Guided conditional diffusion for controllable traffic simulation, 2022.
  34. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. CoRR, abs/1506.06724, 2015. URL http://arxiv.org/abs/1506.06724.
Citations (8)

Summary

  • The paper introduces an autoregressive model using k-disks tokenization that discretizes trajectories with 1 cm accuracy.
  • It employs a GPT-like transformer architecture to capture intra-timestep interactions among vehicles, pedestrians, and cyclists.
  • Experiments demonstrate state-of-the-art improvements of 3.3% and 9.9% on realism and interaction metrics, confirming its scalability and adaptability.

Trajeglish: Learning the Language of Driving Scenarios

Introduction

The paper introduces "Trajeglish," a novel autoregressive model designed for simulating dynamic driving scenarios by imitating the interactions among various road users, including vehicles, pedestrians, and cyclists. The model leverages the principles from discrete sequence modeling, akin to those used in natural language processing, to produce highly realistic traffic simulations. This research aims to bridge a critical gap in self-driving technology by enhancing simulation environments for autonomous vehicles (AVs).

Methodology

Data-Driven Tokenization: Trajeglish employs a data-driven tokenization scheme termed "k-disks" to discretize driving trajectories down to centimeter-level accuracy. This scheme utilizes a small vocabulary size of 384 tokens, enabling intricate modeling of the motion data from the Waymo Open Dataset (WOMD).

Transformer-Based Architecture: The model features a GPT-like transformer architecture that functions autoregressively in time. It also incorporates a mechanism to account for intra-timestep interaction between agents, thereby capturing the nuances in how road users influence each other's movement within a single timestep.

Key Contributions and Results

The main contributions of the paper are:

  1. Tokenization Method: The k-disks approach for discretizing trajectory data achieves an expected discretization error of merely 1 cm, providing a granular and accurate representation of motion data.
  2. Transformer-Based Model: The proposed model conditions on map information and initial states of agents to produce a distribution over future actions. This enables dynamic interaction modeling that is particularly effective for simulating driving environments.
  3. State-of-the-Art Performance: When evaluated on the Waymo Sim Agents Benchmark, Trajeglish surpasses previous state-of-the-art models by 3.3% on the realism meta metric and by 9.9% on the interaction metric, demonstrating its ability to generate more realistic and interactive traffic scenarios.

Experimental Validation

The authors validate their model on multiple fronts:

  • Partial and Full Control Settings: Detailed experiments demonstrate the robustness of Trajeglish in both full and partial control scenarios. The model effectively handles interactions among multiple agents when some are controlled by the model and others are on replay.
  • Scalability: The model's scalability is tested with respect to parameter count and dataset size. Results show that larger models and datasets significantly enhance performance, indicating that Trajeglish benefits substantially from more extensive training data.
  • Transferability: The model's ability to generalize across different datasets is tested using the nuScenes dataset. Fine-tuning Trajeglish on nuScenes scenarios yields lower negative log-likelihood (NLL) compared to training a model from scratch, underscoring its adaptability.

Implications and Future Directions

The practical implications of Trajeglish are noteworthy:

  • Improved Simulation Realism: By modeling interactions more accurately, Trajeglish can significantly enhance the quality of traffic simulations used for testing and developing self-driving systems.
  • Safety Enhancements: Enhanced simulation environments can potentially contribute to safer AV deployment by allowing for more rigorous and realistic testing.

From a theoretical perspective, Trajeglish pushes the boundary in modeling dynamic, multi-agent environments using discrete sequence techniques. The success of the k-disks tokenization approach also opens avenues for similar methods to be applied in other domains requiring fine-grained spatial and temporal modeling.

Future work may delve into further optimizing intra-timestep interactions, exploring larger datasets to fully leverage the model's scalability, and extending the model's application to more complex driving scenarios involving edge cases and rare events.

Conclusion

Trajeglish represents a sophisticated step forward in the domain of self-driving simulations. By combining granular tokenization with a robust autoregressive model, it achieves significant improvements in realism and interaction modeling. This work sets the stage for more advanced and safer simulation environments, critical for the development and deployment of autonomous vehicles.