- The paper introduces CarFormer, a novel method that leverages self-supervised slot attention and an autoregressive transformer to encode complex BEV scenes.
- The method boosts driving performance with a Driving Score of 74.89 and reduces variance compared to traditional scene-level approaches.
- Ablation studies show that block attention and increased slot numbers are crucial for enhancing both dynamic forecasting and overall driving accuracy.
CarFormer: Self-Driving with Learned Object-Centric Representations
Introduction
The task of urban driving poses significant challenges due to the complex interactions between various objects in a scene. Traditional self-driving systems rely on a variety of representation spaces including pixels, coordinates, stixels, and 3D points. Bird's Eye View (BEV) representations have gained prominence for their ability to provide a top-down summary of the scene, encapsulating essential elements such as lanes and vehicles.
In this paper, a novel approach named CarFormer is proposed. It employs learned object-centric representations in BEV to distill complex scenes into actionable information for self-driving. The core innovation lies in representing the scene using object-centric slots. These slots encompass vital spatial and temporal context about the objects without explicitly requiring their exact attributes. Subsequently, an autoregressive transformer is trained to drive and reason about future states based on these slots.
Methodology
CarFormer's approach is two-fold: the extraction of object-centric representations and the training of an autoregressive transformer to interpret these representations for self-driving.
- Slot Extraction:
- Utilizing a slot attention model called SAVi, objects in BEV sequences are allocated into slots.
- Each slot naturally integrates spatial and temporal information such as position, speed, and orientation.
- This process is self-supervised, and it does not necessitate explicitly provided exact attributes.
- Transformer-based Driving:
- The extracted slots serve as input to an autoregressive transformer.
- The transformer's architecture allows it to jointly learn driving behaviors while predicting the dynamics of other vehicles in the scene.
- Enhancements such as block attention replace typical causal attention mechanisms to facilitate better object-to-object and object-to-route interaction modeling.
Results and Evaluation
Driving Performance
The evaluation on the Longest6 benchmark demonstrates that CarFormer with slots significantly outperforms traditional scene-level and object-level approaches. Specifically:
- CarFormer achieves a Driving Score (DS) of 74.89, outperforming scene-level methods like AIM-BEV and ROACH, which score 17.07 and 55.27, respectively.
- Compared to PlanT, CarFormer exhibits lower variance in driving scores, indicating robustness to variations during evaluation runs.
Forecasting Future States
A pivotal aspect of CarFormer is its capability to forecast future slots, serving both as a policy predictor and as a world model:
- Compared to naive input-copy methods, CarFormer significantly improves Adjusted Rand Index (ARI) and mean Intersection over Union (mIoU) metrics at future time steps.
- This forecasting ability underscores the model’s proficiency in capturing and predicting the dynamic interactions between objects in urban driving scenes.
Ablation Studies
Key design choices were scrutinized through ablation studies:
- Block Attention: Removing block attention substantially degraded performance, underscoring its importance in modeling dynamics between all scene objects.
- Slot Number and Enlarging Small Objects: Increasing the number of slots to 30 and enlarging small objects resulted in noteworthy performance enhancements, both in terms of forecasting and driving scores.
Implications and Future Work
The implications of this work are multi-faceted:
- Practical: The robustness and accuracy achieved by incorporating object-centric slots can improve the reliability and safety of self-driving systems. It highlights the benefits of transitioning from explicit object attributes to more holistic, learned representations.
- Theoretical: CarFormer paves the way for future research in self-driving by integrating object-centric learning and sequence modeling. The promise shown in forecasting future states opens avenues for self-supervised learning and potentially more advanced multi-step reasoning frameworks.
For future work, advancements in extracting accurate BEV representations from raw sensor data would complement the CarFormer framework, mitigating the reliance on ground truth BEV. Additionally, extending the architecture to accommodate multi-step forecasting with reinforcement learning could further enhance decision-making capabilities in complex driving scenarios.
Conclusion
CarFormer marks a significant stride in self-driving technology, with its novel use of object-centric slots and autoregressive transformers. By encoding complex spatio-temporal relationships into a learned representation, it achieves superior performance and robustness compared to existing methods. This work not only addresses current challenges but also sets the stage for future innovations in autonomous driving research.