DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving (1505.00256v3)

Published 1 May 2015 in cs.CV

Abstract: Today, there are two major paradigms for vision-based autonomous driving systems: mediated perception approaches that parse an entire scene to make a driving decision, and behavior reflex approaches that directly map an input image to a driving action by a regressor. In this paper, we propose a third paradigm: a direct perception approach to estimate the affordance for driving. We propose to map an input image to a small number of key perception indicators that directly relate to the affordance of a road/traffic state for driving. Our representation provides a set of compact yet complete descriptions of the scene to enable a simple controller to drive autonomously. Falling in between the two extremes of mediated perception and behavior reflex, we argue that our direct perception representation provides the right level of abstraction. To demonstrate this, we train a deep Convolutional Neural Network using recording from 12 hours of human driving in a video game and show that our model can work well to drive a car in a very diverse set of virtual environments. We also train a model for car distance estimation on the KITTI dataset. Results show that our direct perception approach can generalize well to real driving images. Source code and data are available on our project website.

Authors (4)

Chenyi Chen (2 papers)
Ari Seff (15 papers)
Alain Kornhauser (4 papers)
Jianxiong Xiao (14 papers)

Citations (1,696)

View on Semantic Scholar

Summary

Direct Perception for Autonomous Driving: An Intermediate Representation Approach

The paper "DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving" proposes a third paradigm for vision-based autonomous driving systems that bridges the gap between mediated perception and behavior reflex approaches. This novel methodology is termed as "direct perception," wherein the system maps an input image to a small set of key perception indicators that directly relate to the affordance of driving, thereby simplifying the decision-making process for vehicle control.

Existing Paradigms in Autonomous Driving

The two major paradigms within vision-based autonomous driving systems are:

Mediated Perception Approaches: These systems parse an entire scene to reconstruct a semantic 3D world before making driving decisions. While these systems can achieve a high-level understanding of the environment, they introduce unnecessary complexity by requiring solutions to multiple challenging vision tasks.
Behavior Reflex Approaches: These methods directly map sensory inputs to driving actions. Despite their elegance, they can struggle with traffic or complex driving maneuvers due to the ill-posed nature of the problem when multiple plausible actions exist for a given scenario.

Proposed Direct Perception Approach

The direct perception approach introduced in this paper aims to estimate the affordance for driving actions by mapping an input image to specific perception indicators. This approach strikes a balance between the complexity of mediated perception and the simplicity of behavior reflex by encoding the scene into a compact, yet comprehensive set of indicators.

Key components of the direct perception model include:

Affordance Indicators: These indicators encompass the car's relative angle to the road, distances to lane markings, and distances to other vehicles. The representation provides sufficient information to allow a simple controller to make driving decisions.
Convolutional Neural Network (ConvNet): A deep ConvNet is employed to learn the mapping from input images to affordance indicators, trained on 12 hours of driving data from a car racing video game (TORCS).

Evaluation and Performance

The efficacy of the direct perception approach is demonstrated across various virtual environments and real-world datasets, including the KITTI dataset. The results indicate that the proposed model generalizes well to real driving scenarios. Specifically:

TORCS Evaluation: The system is capable of driving autonomously in diverse, simulated environments, with reliable lane and car perception modules. The accuracy of the affordance indicators is evaluated using a set of tracks and cars not included in the training set.
Real-world Testing: Testing on real driving videos and the KITTI dataset showed that the ConvNet-based system could predict the distance to preceding vehicles with a mean absolute error comparable to state-of-the-art methods.

Comparative Analysis

Comparative studies included several baseline methods, such as behavior reflex systems and mediated perception approaches based on the DPM (Deformable Part Model) car detector. The direct perception ConvNet demonstrated superior or comparable performance, especially in accurately estimating distances to nearby vehicles without necessitating complex intermediate representations.

Implications and Future Work

The introduction of the direct perception paradigm represents a significant advancement in simplifying the architecture of autonomous driving systems. This compact scene representation, which is directly tied to actionable driving indicators, can potentially lead to more efficient and cost-effective autonomous vehicles.

The research suggests potential future work in the enhancement and generalization of the direct perception model, such as:

Extending the Training Dataset: Acquiring more diverse real-world driving data to further improve the robustness and accuracy of affordance predictions.
Exploring Additional Indicators: Expanding the set of affordance indicators to include more complex driving scenarios and maneuvers.

Conclusion

The paper presents a compelling case for the direct perception approach as a middle ground between mediated perception and behavior reflex. By focusing on key perception indicators relevant to driving actions, this methodology simplifies the control process, potentially paving the way for more proficient and scalable autonomous driving systems.

PDF Markdown