Grounding Visual Explanations (1807.09685v2)

Published 25 Jul 2018 in cs.CV

Abstract: Existing visual explanation generating agents learn to fluently justify a class prediction. However, they may mention visual attributes which reflect a strong class prior, although the evidence may not actually be in the image. This is particularly concerning as ultimately such agents fail in building trust with human users. To overcome this limitation, we propose a phrase-critic model to refine generated candidate explanations augmented with flipped phrases which we use as negative examples while training. At inference time, our phrase-critic model takes an image and a candidate explanation as input and outputs a score indicating how well the candidate explanation is grounded in the image. Our explainable AI agent is capable of providing counter arguments for an alternative prediction, i.e. counterfactuals, along with explanations that justify the correct classification decisions. Our model improves the textual explanation quality of fine-grained classification decisions on the CUB dataset by mentioning phrases that are grounded in the image. Moreover, on the FOIL tasks, our agent detects when there is a mistake in the sentence, grounds the incorrect phrase and corrects it significantly better than other models.

Citations (221)

View on Semantic Scholar

Summary

The paper presents a novel phrase-critic model that scores AI-generated explanations based on their alignment with actual visual content.
It integrates counterfactual explanation techniques and FOIL tasks to detect and correct grounding errors in image descriptions.
Evaluations on the CUB dataset demonstrate improved textual explanation quality and enhanced interpretability of AI decisions.

An Expert Overview of "Grounding Visual Explanations"

The research presented in the paper “Grounding Visual Explanations” by Hendricks, Hu, Darrell, and Akata introduces a novel paradigm in explainable AI through the development of a phrase-critic model. This model aspires to refine generated visual explanations, ensuring that they are grounded in the actual pictorial evidence rather than generic class priors.

Summary of Approach

Current visual explanation systems frequently generate descriptive language that might capture common class attributes without those attributes necessarily being present in the specific instance of the image. Such disjointed explanations can undermine trust between AI systems and their human users. Addressing this issue, the authors propose a phrase-critic mechanism that critically evaluates the compatibility of visual evidence with the linguistic descriptions offered by AI models.

Their approach involves training a discriminative phrase-critic model using both standard examples and those augmented with "flipped" phrases as negative samples. During inference, the model assesses explanations by scoring them based on image relevance, thus ensuring explanations correspond directly to the image contents.

Technical Advancements

Phrase-Critic Model: This innovation introduces a critical layer that scores an explanation for its congruence with visual inputs. By integrating an LSTM-based explanation model with a natural language grounding model, the technique fosters accountability and transparency in explanations.
Counterfactual Explanations: The system extends beyond conventional factual explanations by generating counterfactual arguments that clarify why an image was classified correctly and not as another class. This capability significantly enhances the interpretability of AI decisions.
FOIL Tasks: The authors extend their framework to address the FOIL task, efficiently detecting inaccuracies in sentence descriptions and correcting them based on grounding in visual data, surpassing prior methodologies significantly.

Quantitative Evaluation

The method was primarily evaluated on the CUB dataset, focusing on fine-grained bird classification. The results demonstrated an improvement in textual explanation quality by effectively grounding phrases in the images. Furthermore, evaluations on FOIL tasks displayed outstanding performance, with the phrase-critic significantly outperforming existing models in identifying and correcting sentence errors due to inaccurate grounding.

Implications and Future Directions

The research implies a pivotal shift toward more reliable visual explanation systems that can foster greater user trust in AI. Practically, such a model can be employed in numerous AI-driven applications including self-driving cars, medical imaging, and intelligent surveillance, where precise and justifiable AI decisions are imperative.

Future refinements and extensions could address more varied datasets and complex scene understanding tasks, potentially integrating deeper semantic analysis or conversational AI systems. Ultimately, this work underscores a movement towards AI systems that are not just powerful decision-makers but are also transparent and accountable in their decision-making process. The phrase-critic model sets a foundation for such advanced AI systems, indicating a promising trajectory in the field of explainable AI.

PDF Markdown