CarFormer: Self-Driving with Learned Object-Centric Representations (2407.15843v1)

Published 22 Jul 2024 in cs.CV, cs.AI, and cs.RO

Abstract: The choice of representation plays a key role in self-driving. Bird's eye view (BEV) representations have shown remarkable performance in recent years. In this paper, we propose to learn object-centric representations in BEV to distill a complex scene into more actionable information for self-driving. We first learn to place objects into slots with a slot attention model on BEV sequences. Based on these object-centric representations, we then train a transformer to learn to drive as well as reason about the future of other vehicles. We found that object-centric slot representations outperform both scene-level and object-level approaches that use the exact attributes of objects. Slot representations naturally incorporate information about objects from their spatial and temporal context such as position, heading, and speed without explicitly providing it. Our model with slots achieves an increased completion rate of the provided routes and, consequently, a higher driving score, with a lower variance across multiple runs, affirming slots as a reliable alternative in object-centric approaches. Additionally, we validate our model's performance as a world model through forecasting experiments, demonstrating its capability to predict future slot representations accurately. The code and the pre-trained models can be found at https://kuis-ai.github.io/CarFormer/.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces CarFormer, a novel method that leverages self-supervised slot attention and an autoregressive transformer to encode complex BEV scenes.
The method boosts driving performance with a Driving Score of 74.89 and reduces variance compared to traditional scene-level approaches.
Ablation studies show that block attention and increased slot numbers are crucial for enhancing both dynamic forecasting and overall driving accuracy.

CarFormer: Self-Driving with Learned Object-Centric Representations

Introduction

The task of urban driving poses significant challenges due to the complex interactions between various objects in a scene. Traditional self-driving systems rely on a variety of representation spaces including pixels, coordinates, stixels, and 3D points. Bird's Eye View (BEV) representations have gained prominence for their ability to provide a top-down summary of the scene, encapsulating essential elements such as lanes and vehicles.

In this paper, a novel approach named CarFormer is proposed. It employs learned object-centric representations in BEV to distill complex scenes into actionable information for self-driving. The core innovation lies in representing the scene using object-centric slots. These slots encompass vital spatial and temporal context about the objects without explicitly requiring their exact attributes. Subsequently, an autoregressive transformer is trained to drive and reason about future states based on these slots.

Methodology

CarFormer's approach is two-fold: the extraction of object-centric representations and the training of an autoregressive transformer to interpret these representations for self-driving.

Slot Extraction:
- Utilizing a slot attention model called SAVi, objects in BEV sequences are allocated into slots.
- Each slot naturally integrates spatial and temporal information such as position, speed, and orientation.
- This process is self-supervised, and it does not necessitate explicitly provided exact attributes.
Transformer-based Driving:
- The extracted slots serve as input to an autoregressive transformer.
- The transformer's architecture allows it to jointly learn driving behaviors while predicting the dynamics of other vehicles in the scene.
- Enhancements such as block attention replace typical causal attention mechanisms to facilitate better object-to-object and object-to-route interaction modeling.

Results and Evaluation

Driving Performance

The evaluation on the Longest6 benchmark demonstrates that CarFormer with slots significantly outperforms traditional scene-level and object-level approaches. Specifically:

CarFormer achieves a Driving Score (DS) of 74.89, outperforming scene-level methods like AIM-BEV and ROACH, which score 17.07 and 55.27, respectively.
Compared to PlanT, CarFormer exhibits lower variance in driving scores, indicating robustness to variations during evaluation runs.

Forecasting Future States

A pivotal aspect of CarFormer is its capability to forecast future slots, serving both as a policy predictor and as a world model:

Compared to naive input-copy methods, CarFormer significantly improves Adjusted Rand Index (ARI) and mean Intersection over Union (mIoU) metrics at future time steps.
This forecasting ability underscores the model’s proficiency in capturing and predicting the dynamic interactions between objects in urban driving scenes.

Ablation Studies

Key design choices were scrutinized through ablation studies:

Block Attention: Removing block attention substantially degraded performance, underscoring its importance in modeling dynamics between all scene objects.
Slot Number and Enlarging Small Objects: Increasing the number of slots to 30 and enlarging small objects resulted in noteworthy performance enhancements, both in terms of forecasting and driving scores.

Implications and Future Work

The implications of this work are multi-faceted:

Practical: The robustness and accuracy achieved by incorporating object-centric slots can improve the reliability and safety of self-driving systems. It highlights the benefits of transitioning from explicit object attributes to more holistic, learned representations.
Theoretical: CarFormer paves the way for future research in self-driving by integrating object-centric learning and sequence modeling. The promise shown in forecasting future states opens avenues for self-supervised learning and potentially more advanced multi-step reasoning frameworks.

For future work, advancements in extracting accurate BEV representations from raw sensor data would complement the CarFormer framework, mitigating the reliance on ground truth BEV. Additionally, extending the architecture to accommodate multi-step forecasting with reinforcement learning could further enhance decision-making capabilities in complex driving scenarios.

Conclusion

CarFormer marks a significant stride in self-driving technology, with its novel use of object-centric slots and autoregressive transformers. By encoding complex spatio-temporal relationships into a learned representation, it achieves superior performance and robustness compared to existing methods. This work not only addresses current challenges but also sets the stage for future innovations in autonomous driving research.

PDF Markdown

Related Papers

GitHub

CarFormer: Self-Driving with Learned Object-Centric Representations

Tweets

https://twitter.com/KuisAICenter/status/1815699500142383211

https://twitter.com/OWW/status/1815898832221192215

https://twitter.com/essobi/status/1837125860807081990

https://twitter.com/_vztu/status/1816228692840112531