End-to-end Learning of Driving Models from Large-scale Video Datasets (1612.01079v2)

Published 4 Dec 2016 in cs.CV

Abstract: Robust perception-action models should be learned from training data with diverse visual appearances and realistic behaviors, yet current approaches to deep visuomotor policy learning have been generally limited to in-situ models learned from a single vehicle or a simulation environment. We advocate learning a generic vehicle motion model from large scale crowd-sourced video data, and develop an end-to-end trainable architecture for learning to predict a distribution over future vehicle egomotion from instantaneous monocular camera observations and previous vehicle state. Our model incorporates a novel FCN-LSTM architecture, which can be learned from large-scale crowd-sourced vehicle action data, and leverages available scene segmentation side tasks to improve performance under a privileged learning paradigm.

Authors (4)

Huazhe Xu (93 papers)
Yang Gao (761 papers)
Fisher Yu (104 papers)
Trevor Darrell (324 papers)

Citations (797)

View on Semantic Scholar

Summary

Insights into "End-to-end Learning of Driving Models from Large-scale Video Datasets"

The paper "End-to-end Learning of Driving Models from Large-scale Video Datasets" by Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell presents a novel approach to developing autonomous driving models. This paper leverages large-scale, crowd-sourced video datasets to train a deep learning architecture designed for real-world driving scenarios. The primary focus is on the end-to-end learning of visuomotor policies, specifically targeting the prediction of future vehicle movements from visual inputs and previous vehicle states.

Core Contributions

The paper makes four main contributions to the domain of autonomous driving:

Generic Motion Models: The authors propose a generic motion approach to learning visuomotor action policies, independent of vehicle-specific actuation mechanisms. This is crucial for training models that can generalize across different vehicles and driving environments.
FCN-LSTM Architecture: They introduce a novel Fully Convolutional Network (FCN) paired with Long Short-Term Memory (LSTM) networks to process visual and temporal data, respectively. This architecture enhances the model's ability to predict future vehicle motions.
Privileged Training Paradigm: The paper employs a training paradigm that leverages side tasks, such as semantic segmentation, to improve the main task of motion prediction. This "privileged" approach enhances the model's learning efficiency and robustness.
Large-Scale Dataset: The paper presents the Berkeley DeepDrive Video dataset (BDDV), comprising over 10,000 hours of driving video data. This dataset is made publicly available and serves as a valuable resource for the community.

Methodology and Experiments

Architecture and Data

The authors employ an FCN to encode visual information from video frames and an LSTM to capture temporal dependencies from previous states and actions. By integrating these architectures, the model predicts future vehicle egomotion, both in discrete and continuous action spaces.

The dataset plays a critical role in this research. The BDDV dataset's extensive and diverse nature, covering multiple cities, weather conditions, and driving scenarios, distinguishes it from other available datasets. This allows the model to be trained under varied conditions, improving its generalizability.

Evaluation Metrics

The model's performance is evaluated using two primary metrics: predictive perplexity and accuracy. Perplexity measures the model's ability to predict the probability distribution of future actions, while accuracy evaluates the correctness of the predicted actions.

Results and Implications

The results indicate that the proposed FCN-LSTM model can effectively predict future vehicle motions under diverse driving conditions. Notably, the experiments on discrete actions (e.g., going straight, stopping, turning left, right) showcase the model's ability to learn complex driving behaviors, such as responding to traffic lights and maintaining safe distances.

Incorporating privileged information, particularly semantic segmentation, significantly enhances performance. The privileged training paradigm demonstrates efficiency in learning, especially when labeled training data is limited. This suggests that integrating auxiliary tasks can help in capturing fine-grained details essential for accurate motion prediction.

Theoretical and Practical Implications

Theoretical

The research advances the understanding of end-to-end learning for autonomous driving by demonstrating that models can benefit from large-scale, uncalibrated video data. The FCN-LSTM architecture sets a precedent for future studies aiming to combine spatial and temporal features for predictive tasks in dynamic environments.

Practical

From a practical standpoint, the ability to predict future egomotion with high accuracy has significant implications for real-world autonomous driving systems. The model's generalizability across different vehicles and conditions could accelerate the deployment of autonomous driving technologies in varied environments. Furthermore, the BDDV dataset provides a rich resource for future research, facilitating the development and validation of more robust driving models.

Future Directions

The paper outlines several potential avenues for future work, including extending the model to control real vehicles and improving policy coverage across less demonstrated regions of the policy space. Addressing these challenges will require further exploration of reinforcement learning techniques and integrating richer sensory inputs.

Conclusion

The work by Xu et al. marks a substantial step toward developing more generalized and robust autonomous driving models. The combination of large-scale data, innovative architecture, and privileged learning paves the way for future advancements in the field. Continued research in this direction promises to enhance the safety and efficiency of autonomous vehicles, ultimately leading to broader adoption and deployment of autonomous driving technologies.

PDF Markdown

Related Papers

Find Related Papers