Multimodal Interaction-aware Motion Prediction for Autonomous Street Crossing

Published 21 Aug 2018 in cs.RO and cs.CV | (1808.06887v5)

Abstract: For mobile robots navigating on sidewalks, it is essential to be able to safely cross street intersections. Most existing approaches rely on the recognition of the traffic light signal to make an informed crossing decision. Although these approaches have been crucial enablers for urban navigation, the capabilities of robots employing such approaches are still limited to navigating only on streets containing signalized intersections. In this paper, we address this challenge and propose a multimodal convolutional neural network framework to predict the safety of a street intersection for crossing. Our architecture consists of two subnetworks; an interaction-aware trajectory estimation stream IA-TCNN, that predicts the future states of all observed traffic participants in the scene, and a traffic light recognition stream AtteNet. Our IA-TCNN utilizes dilated causal convolutions to model the behavior of the observable dynamic agents in the scene without explicitly assigning priorities to the interactions among them. While AtteNet utilizes Squeeze-Excitation blocks to learn a content-aware mechanism for selecting the relevant features from the data, thereby improving the noise robustness. Learned representations from the traffic light recognition stream are fused with the estimated trajectories from the motion prediction stream to learn the crossing decision. Furthermore, we extend our previously introduced Freiburg Street Crossing dataset with sequences captured at different types of intersections, demonstrating complex interactions among the traffic participants. Extensive experimental evaluations on public benchmark datasets and our proposed dataset demonstrate that our network achieves state-of-the-art performance for each of the subtasks, as well as for the crossing safety prediction.

Abstract PDF Upgrade to Chat

Citations (49)

View on Semantic Scholar

Summary

The paper proposes a unified framework that fuses AtteNet’s robust traffic light recognition with IA-TCNN's dilated causal convolutions for interaction-aware trajectory prediction.
The motion prediction module achieves state-of-the-art average and final displacement error metrics on datasets such as ETH, UCY, and L-CAS.
Real-world experiments using the extended FSC dataset validate the system's capacity to improve decision-making for safe autonomous street crossing.

Multimodal Interaction-aware Motion Prediction for Autonomous Street Crossing

The paper presents a framework for enabling mobile robots to safely cross street intersections by predicting intersection safety through multimodal data integration. This work introduces a system composed of a traffic light recognition network and an interaction-aware motion prediction network, both of which are crucial for making informed decisions about crossing.

System Architecture

The proposed system architecture integrates two primary components: AtteNet for traffic light recognition and IA-TCNN for motion prediction.

Figure 1: Schematic representation of our proposed system for autonomous street crossing. Our approach is comprised of two main modules; a traffic light recognition network and an interaction-aware motion prediction network. Feature map outputs of both modules are utilized to predict the safety of the intersection for crossing.

AtteNet: This network is designed to recognize traffic light states robustly across varying illumination conditions. It incorporates Squeeze-Excitation (SE) blocks to emphasize relevant features and suppress irrelevant ones, enhancing feature map recalibration.
IA-TCNN: The architecture uses dilated causal convolutions to predict future states of dynamic agents (e.g., pedestrians, vehicles) without predefined interaction modeling. It provides simultaneous trajectory estimates leveraging all observed agents, optimizing prediction through an innovative sequence-to-sequence modeling method.

Figure 2: Illustration of the proposed network architecture for interaction-aware motion prediction. We propose two variants of our architecture: (a) IA-TCNN and (b) IA-DResTCNN. The legend enclosed in red dashed lines shows the constituting layers for each block.

The system fuses outputs from these networks to form a robust road-crossing predictor that considers traffic light status, observed motions, and predicted trajectories.

Motion Prediction

For motion prediction, IA-TCNN utilizes trajectory data to model complex agent interactions without explicit priority assignments. This is achieved through temporal convolutional layers with dilations, which enhance receptive field sizes without increasing network depth significantly.

The network predicts future trajectories over intervals longer than the observed trajectory, demonstrating superior predictive performance compared to recurrent neural network approaches such as LSTMs. This is verified across various datasets, including ETH, UCY, and L-CAS, where it achieves state-of-the-art metrics on average and final displacement errors.

Traffic Light Recognition

AtteNet, the proposed traffic light recognition network, is built upon ResNet-50 architecture but includes enhancements such as SE blocks and usage of ELUs for activation. These design choices contribute to higher robustness against noisy inputs and a significant improvement in classification accuracy. Evaluations on publicly available datasets (e.g., Nexar, Bosch) highlight its superiority over traditional networks like SqueezeNet and DenseNet, especially in precision and recall metrics across traffic light states.

Figure 3: Precision-Recall plots comparing the performance of various methods on the Nexar dataset. The proposed AtteNet outperforms the compared approaches for all traffic light states.

Crossing Safety Prediction

Predictive performance for intersection safety combines learned outputs from both subnetworks in a fusion mechanism. This fusion incorporates uncertainty modeling to bolster robustness against individual network prediction errors.

Real-world validation experiments executed on a robotic platform affirm the framework's viability for autonomous street crossing in urban environments. The extended Freiburg Street Crossing dataset, augmented with new sequences and annotations of varying intersection types, serves as an evaluation ground to further substantiate the findings.

Figure 4: Depiction of various crossing scenarios from the FSC dataset. Detected dynamic objects are represented by arrows, where magenta arrows signify objects moving towards the robot, and green arrows signify objects moving away from the robot.

Conclusion

This research advances the autonomous navigation capability of mobile robots by integrating perception for traffic dynamics and signal recognition into a unified decision-making model. It demonstrates strong performance across multiple challenging domains and provides a foundation for future systems seeking to operate safely and efficiently in mixed-traffic environments. Future work is suggested in enriching the spatial-awareness dimension of the model, particularly through semantic understanding and structural map integration.

Markdown