Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future (2412.14056v1)

Published 18 Dec 2024 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.MM

Abstract: AI has rapidly developed through advancements in computational power and the growth of massive datasets. However, this progress has also heightened challenges in interpreting the "black-box" nature of AI models. To address these concerns, eXplainable AI (XAI) has emerged with a focus on transparency and interpretability to enhance human understanding and trust in AI decision-making processes. In the context of multimodal data fusion and complex reasoning scenarios, the proposal of Multimodal eXplainable AI (MXAI) integrates multiple modalities for prediction and explanation tasks. Meanwhile, the advent of LLMs has led to remarkable breakthroughs in natural language processing, yet their complexity has further exacerbated the issue of MXAI. To gain key insights into the development of MXAI methods and provide crucial guidance for building more transparent, fair, and trustworthy AI systems, we review the MXAI methods from a historical perspective and categorize them across four eras: traditional machine learning, deep learning, discriminative foundation models, and generative LLMs. We also review evaluation metrics and datasets used in MXAI research, concluding with a discussion of future challenges and directions. A project related to this review has been created at https://github.com/ShilinSun/mxai_review.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shilin Sun (2 papers)
  2. Wenbin An (14 papers)
  3. Feng Tian (122 papers)
  4. Fang Nan (8 papers)
  5. Qidong Liu (36 papers)
  6. Jun Liu (606 papers)
  7. Nazaraf Shah (2 papers)
  8. Ping Chen (123 papers)

Summary

A Comprehensive Overview of Multimodal Explainable Artificial Intelligence

The paper, "A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future," provides a meticulous analysis of the evolution of Multimodal Explainable Artificial Intelligence (MXAI) over several technological eras. It offers a systematic classification of MXAI methods and highlights the challenges and potential advancements in the field.

Historical Progression

The authors categorize MXAI development into four distinct eras:

  1. Traditional Machine Learning (2000-2009): This phase primarily focuses on simpler models such as decision trees and Bayesian frameworks, where interpretability is derived from manual feature selection. Techniques like Principal Component Analysis (PCA) were employed for data simplification, benefitting from reduced dimensions for clearer interpretability.
  2. Deep Learning (2010-2016): With the rise of complex neural networks, the challenge transitioned to making these models more transparent. Intrinsic interpretability methods emerged along with visualization techniques for neural activations. Efforts pivoted towards local and global explanation strategies for understanding network decisions.
  3. Discriminative Foundation Models (2017-2021): The advent of foundation models like Transformers brought about large-scale pre-trained models excelling across various tasks with few adjustments. The interpretability focus shifted towards understanding and explaining models like CLIP and GNN-based architectures using methods such as attention visualization and counterfactual reasoning.
  4. Generative LLMs (2022-2024): Recent advances highlight generative models like GPT-4, which can provide natural language explanations. These developments are pushing the boundaries of explainability, integrating robustly across various data modalities, and enabling clearer interpretations of model outputs.

Evaluation Metrics and Datasets

The paper provides a curated list of metrics and datasets central to evaluating the performance of MXAI methods. These include text explanation metrics (e.g., BLEU, CIDEr, SPICE), visual explanation metrics (e.g., IoU), and multimodal metrics like CLIP Scores. Additionally, datasets like VQA-X, TextVQA-X, and others serve as crucial benchmarks for assessing state-of-the-art models.

Challenges and Future Directions

The review acknowledges several challenges facing MXAI:

  • Hallucination in MLLMs: The paper underlines ongoing efforts to mitigate hallucination within LLMs through techniques like counterfactual samples.
  • Visual Complexity: MLLMs face significant hurdles with high-dimensional visual data, necessitating improved multimodal fusion methods and integration strategies.
  • Alignment with Human Cognition: There’s a pressing need to align AI models more closely with human cognitive processes to enhance interpretability and build trust.
  • Absence of Ground Truths: Establishing reliable ground truths in multimodal contexts presents difficulties due to complex and subjective nature of data, necessitating innovative approaches in evaluation.

Conclusion

The authors conclude by emphasizing that MXAI's progress is pivotal for future AI systems aiming to be transparent, fair, and trustworthy. They highlight the ongoing evolution of explanatory methods across different technical epochs and propose a continuous balance between model sophistication and interpretability as essential for future advancements. The paper serves as a crucial resource for researchers striving to navigate the intricacies of AI explainability in an increasingly multimodal world.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com