AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark (1807.03418v3)

Published 9 Jul 2018 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: Explainable Artificial Intelligence (XAI) is targeted at understanding how models perform feature selection and derive their classification decisions. This paper explores post-hoc explanations for deep neural networks in the audio domain. Notably, we present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits which we use for classification tasks on spoken digits and speakers' biological sex. We use the popular XAI technique Layer-wise Relevance Propagation (LRP) to identify relevant features for two neural network architectures that process either waveform or spectrogram representations of the data. Based on the relevance scores obtained from LRP, hypotheses about the neural networks' feature selection are derived and subsequently tested through systematic manipulations of the input data. Further, we take a step beyond visual explanations and introduce audible heatmaps. We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.

Citations (81)

View on Semantic Scholar

Summary

The paper introduces the AudioMNIST dataset, comprising 30,000 English spoken digit samples for benchmarking audio classification and speaker recognition.
The paper employs two CNN architectures, processing both waveform and spectrogram data, and achieves up to 95.82% accuracy in digit classification.
The paper uses Layer-wise Relevance Propagation to generate audible explanations that enhance the interpretability of neural network decisions in audio analysis.

Overview of "AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark"

The paper "AudioMNIST: Exploring Explainable Artificial Intelligence for Audio Analysis on a Simple Benchmark" explores the intersection of explainable artificial intelligence (XAI) and audio analysis, proposing a novel dataset intended for benchmarking audio classification tasks. The authors present methodologies to enhance the interpretability of deep neural networks in the audio domain, particularly focusing on Layer-wise Relevance Propagation (LRP) as a tool for elucidating model decisions.

Contributions

AudioMNIST Dataset: The paper introduces the AudioMNIST dataset, featuring 30,000 audio samples of English spoken digits. This dataset is designed to facilitate research in audio classification, offering tasks such as digit and speaker sex recognition. Its structure draws inspiration from the MNIST dataset renowned in computer vision.
Neural Network Architectures: Two distinct model architectures are examined: one operating directly on waveform data and another utilizing spectrogram representations. These architectures serve to demonstrate the versatility and effectiveness of CNNs in processing different forms of audio data.
Layer-wise Relevance Propagation (LRP): LRP is employed to elucidate the classification strategies of the neural networks. By decomposing the model's output into relevance scores associated with input features, this post-hoc method sheds light on the model's feature selection process.
Audible Explanations: Beyond the conventional heatmap visualizations, the paper innovates with "audible heatmaps," which translate relevance scores back into an audio format. A user paper underlines the better interpretability of these audible explanations compared to visual explanations for human users.

Numerical Results

The models achieve high accuracy across classification tasks, with the spectrogram-based model (AlexNet) slightly outperforming the waveform-based model (AudioNet). For digit classification, AlexNet achieves an accuracy of approximately 95.82%, whereas AudioNet scores 92.53%. In terms of sex classification, AlexNet reaches 95.87% accuracy, with AudioNet at 91.74%.

Bold Claims

The paper highlights the superior interpretability of audible explanations compared to visual ones. This bold claim is backed by a user paper where participants showed a higher level of understanding of the model's decisions through audible explanations, particularly in cases of incorrect predictions.

Implications and Future Directions

The paper makes a significant impact by proposing a dataset and methodologies that can serve as a foundation for future audio AI research. The AudioMNIST dataset may become a standard benchmark for testing novel audio classification models and XAI techniques.

The development of audible explanations marks an innovative step towards enhancing human-AI interaction in the audio domain. This approach potentially redefines how models can be made transparent, especially in contexts where audio interpretation by non-experts is critical.

In terms of future directions, expanding research into concept-based XAI methods in the audio domain could further enhance interpretability. Additionally, integrating these techniques into real-world applications, such as assistive technologies or voice-activated systems, could offer practical benefits and drive further advancements in AI transparency.

Conclusion

The paper contributes a notable advancement in the field of audio analysis with XAI, facilitating better interpretability and transparency of deep learning models. By introducing the AudioMNIST dataset and proposing innovative explanation formats, it paves the way for deeper exploration into explainable audio AI, encouraging the development of more understandable and trustworthy AI systems.

PDF Markdown

Related Papers

GitHub

GitHub - soerenab/AudioMNIST (360 stars)