ADAPT: Action-aware Driving Caption Transformer (2302.00673v1)

Published 1 Feb 2023 in cs.CV and cs.HC

Abstract: End-to-end autonomous driving has great potential in the transportation industry. However, the lack of transparency and interpretability of the automatic decision-making process hinders its industrial adoption in practice. There have been some early attempts to use attention maps or cost volume for better model explainability which is difficult for ordinary passengers to understand. To bridge the gap, we propose an end-to-end transformer-based architecture, ADAPT (Action-aware Driving cAPtion Transformer), which provides user-friendly natural language narrations and reasoning for each decision making step of autonomous vehicular control and action. ADAPT jointly trains both the driving caption task and the vehicular control prediction task, through a shared video representation. Experiments on BDD-X (Berkeley DeepDrive eXplanation) dataset demonstrate state-of-the-art performance of the ADAPT framework on both automatic metrics and human evaluation. To illustrate the feasibility of the proposed framework in real-world applications, we build a novel deployable system that takes raw car videos as input and outputs the action narrations and reasoning in real time. The code, models and data are available at https://github.com/jxbbb/ADAPT.

Citations (56)

View on Semantic Scholar

Summary

The paper introduces ADAPT, a transformer-based framework that jointly generates driving captions and predicts control signals to improve interpretability.
It employs a dual-task architecture integrating a Video Swin Transformer for encoding and a vision-language transformer for sequential narration.
Evaluations on the BDD-X dataset demonstrate enhanced performance, outperforming prior methods in key metrics like CIDEr and BLEU.

Action-aware Driving Caption Transformer (ADAPT): A Technical Overview

The paper "ADAPT: Action-aware Driving Caption Transformer" introduces a novel framework aimed at enhancing the interpretability of autonomous driving systems through natural language explanations of driving actions. This architecture addresses transparency issues in autonomous systems by generating user-friendly narrations that describe vehicular actions and offer reasoning for these actions.

Motivation and Problem Statement

The deployment of autonomous vehicles is often hindered by their opaque decision-making processes. Existing attempts to improve model interpretability, such as attention maps, do not offer a comprehensible solution for average users. ADAPT addresses this gap by combining end-to-end learning techniques with natural language processing to explain vehicular actions in real-time.

Methodology

ADAPT employs a transformer-based architecture, utilizing a dual-task framework that integrates driving caption generation (DCG) and control signal prediction (CSP). The shared video representation forms the basis for these tasks:

Video Encoder: Implements Video Swin Transformer to process video frames into feature tokens.
Text Generation Head: Generates action narration and reasoning through a vision-language transformer. Tokens are generated sequentially, conditioned on previous outputs, and cross-attention mechanisms facilitate understanding between narration and reasoning segments.
Control Signal Prediction Head: Uses a motion transformer to predict control signals from video inputs.

The major innovation lies in the alignment of these tasks at the semantic level, with joint training enhancing the model's performance across both tasks.

Dataset and Implementation

Empirical evaluations are conducted on the BDD-X dataset, consisting of video and control signal data annotated for driving behaviors. Joint training of the video representation substantially boosts captioning performance, as evidenced by improvements on metrics like CIDEr and BLEU. Notably, ADAPT outperforms prior approaches such as S2VT and variant architectures.

Results and Evaluation

The framework shows robust performance in both automatic metrics and human evaluations. For human-centric evaluation, ADAPT demonstrates high accuracy in capturing both the vehicle’s actions and the underlying reasoning. The inclusion of varied types of control signals (e.g., speed, course) further enhances task performance by ensuring comprehensive data utilization.

Discussion and Implications

ADAPT’s contribution lies in its unified approach to action narration and control prediction, offering significant interpretability advantages for autonomous systems. This confers not only potential enhancements in user trust but also opens avenues for broader applications, such as interactive and accessible transportation solutions.

While current results display promising insights, future research could explore scalability to more complex driving environments and integration of additional sensory inputs. Extensions might include developing more nuanced reasoning models capable of handling unexpected environmental changes.

Conclusion

By synthesizing transformer-based approaches with multi-task learning, ADAPT sets a benchmark for language-based interpretability in autonomous driving. This framework underscores a significant step towards transparent, real-time communication of autonomous systems' decision-making processes, bridging a vital gap between advanced AI technologies and human-centric applications. With further refinement, ADAPT could embody a keystone model for enhancing AI interpretability in real-world applications.

PDF Markdown

Related Papers

GitHub

GitHub - jxbbb/ADAPT: This repository is an official implementation of ADAPT: Action-aware Driving Caption Transformer, accepted by ICRA 2023. (372 stars)