Multimodal Explanations: Justifying Decisions and Pointing to the Evidence (1802.08129v1)

Published 15 Feb 2018 in cs.AI, cs.CL, and cs.CV

Abstract: Deep models that are both effective and explainable are desirable in many settings; prior explainable models have been unimodal, offering either image-based visualization of attention weights or text-based generation of post-hoc justifications. We propose a multimodal approach to explanation, and argue that the two modalities provide complementary explanatory strengths. We collect two new datasets to define and evaluate this task, and propose a novel model which can provide joint textual rationale generation and attention visualization. Our datasets define visual and textual justifications of a classification decision for activity recognition tasks (ACT-X) and for visual question answering tasks (VQA-X). We quantitatively show that training with the textual explanations not only yields better textual justification models, but also better localizes the evidence that supports the decision. We also qualitatively show cases where visual explanation is more insightful than textual explanation, and vice versa, supporting our thesis that multimodal explanation models offer significant benefits over unimodal approaches.

PDF Abstract

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence

The increasing complexity of deep learning models poses a challenge in terms of interpretability, which is crucial for various real-world applications. The paper "Multimodal Explanations: Justifying Decisions and Pointing to the Evidence" addresses this concern by proposing a novel method for generating explanations that utilize both visual and textual modalities. The authors argue that combining these modalities can provide complementary strengths, enhancing the interpretability of complex models. Specifically, this paper presents a framework called the Pointing and Justification Explanation (PJ-X) model, which can generate visual pointing and textual justifications, enriched with an attention mechanism that effectively localizes evidence in support of model decisions.

Central to this paper is the introduction of two newly curated datasets, ACT-X and VQA-X, which are designed to assess the performance of the PJ-X model in multimodal explanation tasks for activity recognition and visual question answering (VQA), respectively. These datasets are crucial as they include human-provided visual and textual justifications that serve as ground truth for training and evaluating explanation models, thus filling a notable gap in existing resources.

Key quantitative findings are presented to demonstrate the effectiveness of multimodal explanations. The paper reports that models trained with textual explanations (provided in the datasets) not only yield improved textual justification capabilities but also better localization of visual evidence. This is quantitatively validated through standard metrics like BLEU-4, METEOR, and CIDEr for textual explanations and correlation metrics for visual pointing. Moreover, the PJ-X model shows a marked improvement over unimodal approaches and strong baselines, suggesting that the dual approach to explanations enhances both interpretability and model performance.

Qualitative analyses further bolster this claim by comparing visual explanations with textual counterparts, offering insights into the strengths and capabilities of each modality. There are distinctive cases within the dataset where the visual explanation provides substantial insight into the question at hand, relative to textual explanations, highlighting how multimodal explanations can capture a richer understanding of the context.

The practical implications of this work extend to integrating explainability into AI systems used in sensitive contexts such as healthcare, autonomous systems, and legal decision-making. The merger of visual pointing with textual justifications not only aids end-users in understanding model decisions but also offers an avenue for developers to diagnose and refine model performance effectively. Theoretically, this work sets a precedent for future research to explore further integration of multimodal explanations in AI architectures, promoting deeper learning in alignment with human-like reasoning processes.

Future research can build upon this foundational work by extending the approach to address more complex reasoning tasks, incorporating additional modalities, and refining attention mechanisms for finer granularity in both visual and textual dimensions. This avenue of research holds promise for developing AI systems that better communicate their inner workings, fostering trust and adoption in domains where transparency and accountability are paramount.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Dong Huk Park (12 papers)
Lisa Anne Hendricks (37 papers)
Zeynep Akata (144 papers)
Anna Rohrbach (53 papers)
Bernt Schiele (210 papers)
Trevor Darrell (324 papers)
Marcus Rohrbach (75 papers)

Citations (403)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos