Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding

Published 19 Oct 2023 in cs.CV and cs.RO | (2310.12970v1)

Abstract: The real-world deployment of an autonomous driving system requires its components to run on-board and in real-time, including the motion prediction module that predicts the future trajectories of surrounding traffic participants. Existing agent-centric methods have demonstrated outstanding performance on public benchmarks. However, they suffer from high computational overhead and poor scalability as the number of agents to be predicted increases. To address this problem, we introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers. Then, based on KNARPE we present the Heterogeneous Polyline Transformer with Relative pose encoding (HPTR), a hierarchical framework enabling asynchronous token update during the online inference. By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods. Experiments on Waymo and Argoverse-2 datasets show that HPTR achieves superior performance among end-to-end methods that do not apply expensive post-processing or model ensembling. The code is available at https://github.com/zhejz/HPTR.

Abstract PDF Upgrade to Chat

Citations (22)

View on Semantic Scholar

Summary

The paper introduces the Heterogeneous Polyline Transformer (HPTR) using the novel K-nearest Neighbor Attention with Relative Pose Encoding (Knarpe) to improve efficiency and scalability in real-time motion prediction for autonomous driving.
HPTR demonstrates superior computational efficiency, achieving 40 frames per second prediction for up to 64 agents and reducing memory and latency by 80% while maintaining competitive accuracy (e.g., 0.4222 mAP on Waymo).
The HPTR architecture offers potential for deployment in real-time onboard systems due to its efficiency and provides a foundation for future enhancements like 3D scenarios and intent-based models.

Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding

The paper presents a sophisticated solution to the challenge of real-time motion prediction in autonomous driving using a novel Heterogeneous Polyline Transformer with Relative Pose Encoding (HPTR). This work builds on the concept that for an autonomous driving system to function safely and efficiently, it must accurately predict the future trajectories of nearby vehicles, pedestrians, and other dynamic agents. Traditional agent-centric prediction methods, while effective, are computationally expensive and do not scale well with an increasing number of agents. To address these limitations, this paper introduces the K-nearest Neighbor Attention with Relative Pose Encoding (Knarpe), a new attention mechanism integrated within Transformers, to improve the efficiency and scalability of motion prediction tasks.

Key Contributions

Knarpe Attention Mechanism: The paper proposes the Knarpe attention mechanism, which leverages k-nearest neighbor principles to compute attention over inputs represented in a pairwise-relative coordinate system. By encoding relative positional information, Knarpe efficiently processes inputs and maintains rotational and translational invariance, critical for accurate motion prediction.
Hierarchical Transformer Architecture: HPTR utilizes a hierarchical transformer framework that organizes polylines representing different elements (e.g., roads, vehicles) into heterogeneous tokens. This hierarchy enables a structured method for asynchronous token updates during online inference, significantly reducing computation demands by reusing static map features and geographically local interactions.
Efficient Inference and Scalability: The implementation of HPTR, bolstered by Knarpe, demonstrates superior computational efficiency. It can handle predictions for up to 64 agents in real-time, achieving a processing speed of 40 frames per second, while maintaining accuracy competitive with state-of-the-art agent-centric approaches. This efficiency is evidenced by an 80% reduction in memory use and inference latency compared to traditional methods.

Numerical Performance

The authors validate their approach using the Waymo Open Motion Dataset and the Argoverse-2 dataset, showcasing HPTR's strong performance without relying on model ensembling or extensive post-processing, typical tactics in this domain. Results indicate that HPTR achieves a soft mean Average Precision (mAP) of 0.4222 on the Waymo validation set, demonstrating robust trajectory prediction capabilities. Furthermore, HPTR displays reduced miss rates and prediction errors, particularly in complex and interactive urban driving environments.

Implications and Future Directions

HPTR's efficiency and performance highlight its potential application in real-time onboard systems deployed in autonomous vehicles, where computational resources are limited. The architecture's adaptability and the innovative use of relative pose encoding offer promising avenues for further optimization in motion forecasting.

Looking forward, enhancements could include extending the pose derivative to handle three-dimensional scenarios and incorporating more sophisticated distance metrics that account for environmental and spatial-temporal factors intrinsic to traffic flow. Additionally, integrating goal-conditioned or intent-based prediction models could enhance the model's multi-modality and overall confidence in trajectory forecasts.

By addressing critical efficiency and scalability issues, HPTR represents an important step toward deploying real-time, accurate motion prediction systems within the autonomous driving stack, paving the way for safer and more reliable self-driving technology.

Markdown