Multimodal Motion Prediction with Stacked Transformers (2103.11624v2)

Published 22 Mar 2021 in cs.CV and cs.AI

Abstract: Predicting multiple plausible future trajectories of the nearby vehicles is crucial for the safety of autonomous driving. Recent motion prediction approaches attempt to achieve such multimodal motion prediction by implicitly regularizing the feature or explicitly generating multiple candidate proposals. However, it remains challenging since the latent features may concentrate on the most frequent mode of the data while the proposal-based methods depend largely on the prior knowledge to generate and select the proposals. In this work, we propose a novel transformer framework for multimodal motion prediction, termed as mmTransformer. A novel network architecture based on stacked transformers is designed to model the multimodality at feature level with a set of fixed independent proposals. A region-based training strategy is then developed to induce the multimodality of the generated proposals. Experiments on Argoverse dataset show that the proposed model achieves the state-of-the-art performance on motion prediction, substantially improving the diversity and the accuracy of the predicted trajectories. Demo video and code are available at https://decisionforce.github.io/mmTransformer.

Authors (5)

Yicheng Liu (25 papers)
Jinghuai Zhang (9 papers)
Liangji Fang (12 papers)
Qinhong Jiang (14 papers)
Bolei Zhou (134 papers)

Citations (307)

View on Semantic Scholar

Summary

Multimodal Motion Prediction with Stacked Transformers

The increasing reliance on autonomous vehicles necessitates reliable motion prediction of surrounding objects. This paper, "Multimodal Motion Prediction with Stacked Transformers," proposes a novel approach called mmTransformer to address the challenges in predicting multiple plausible future trajectories for nearby vehicles. The authors focus on improving both the diversity and accuracy of trajectory predictions by leveraging a stacked transformer architecture coupled with a region-based training strategy. Below is an insightful summary of the key components, results, and implications of this research.

Background and Motivation

Accurate and diverse motion prediction is vital for autonomous driving systems to navigate effectively and safely. Traditional approaches typically employ probabilistic methods or proposal-based methods. However, both approaches face limitations; probabilistic methods often struggle with optimization instability and mode collapse, while proposal-based methods heavily rely on predefined anchors, which may not fully capture the inherent multimodality of traffic scenarios. Thus, it becomes imperative to explore alternative architectures that can handle these inherent complexities.

Methodology

The mmTransformer framework is built on a fundamentally stacked transformer architecture that captures multimodal motion prediction at the feature level by integrating multiple sources of contextual information. Specifically, the model consists of three primary components: Motion Extractor, Map Aggregator, and Social Constructor. Each of these is dedicated to capturing different facets of the contextual data, such as past vehicle trajectories, road topology, and interaction with surrounding agents. This hierarchical design allows mmTransformer to learn diverse proposal features, mitigating potential biases inherent in previous approaches.

Furthermore, the paper introduces a Region-based Training Strategy (RTS). This strategy divides the prediction space into several non-overlapping regions, allowing the framework to focus on refining trajectory proposals specific to these regions. This division ensures that each proposal is more distinct, effectively enhancing the multimodality of the predicted trajectories.

Experimental Results

Empirical evaluations on the Argoverse dataset demonstrate that mmTransformer achieves state-of-the-art performance metrics. Notably, it significantly improves the prediction coverage across varying traffic scenarios, as illustrated by better scores on metrics such as Average Displacement Error (ADE), Final Displacement Error (FDE), and Miss Rate (MR). By leveraging 36 trajectory proposals with RTS, the model attains an MR of only 9.2% on the validation dataset, underscoring its ability to predict multiple plausible futures accurately. This performance is further validated by mmTransformer's leading position on the Argoverse Leaderboard.

Implications and Future Directions

The contributions of this research are twofold. In practical terms, the use of stacked transformers and RTS lays the groundwork for more robust autonomous navigation systems capable of dealing with uncertainties in real-world environments. Theoretically, this paper enriches the literature on leveraging attention mechanisms beyond language processing, proving their efficacy in complex, multimodal tasks such as motion prediction.

Moving forward, future work could explore several avenues. Advancements might include investigating different partition schemes or optimizing the balance between proposal diversity and accuracy. Additionally, exploring the model's scalability to more general traffic scenarios or its integration into broader autonomous driving frameworks could further enhance its applicability.

In conclusion, the mmTransformer framework represents a critical step in improving the multimodal motion prediction for autonomous vehicles. By effectively integrating contextual information through stacked transformers and ensuring proposal diversity with a region-based training strategy, this work presents a compelling approach to addressing the nuanced challenges in autonomous vehicle navigation.

Related Papers

GitHub

Multimodal Motion Prediction with Stacked Transformers