Listenable Maps for Zero-Shot Audio Classifiers (2405.17615v1)

Published 27 May 2024 in cs.SD, cs.LG, eess.AS, and eess.SP

Abstract: Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces LMAC-ZS—a novel decoder-based method that generates saliency maps to interpret zero-shot audio classifiers.
It employs a new loss function and dual conditioning on text and audio to preserve decision-relevant features in the Mel and STFT domains.
Empirical evaluations on ESC50 and UrbanSound8K show that LMAC-ZS outperforms methods like GradCAM++ and Integrated Gradients in faithful interpretability.

Overview of "Listenable Maps for Zero-Shot Audio Classifiers"

The paper "Listenable Maps for Zero-Shot Audio Classifiers" introduces a novel approach for interpreting the decisions of zero-shot audio classifiers using a post-hoc decoder-based method. The primary contribution of this work is the development of LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which is, to the authors' knowledge, the inaugural method specifically designed to offer post-hoc interpretability for zero-shot audio classification tasks. This development is particularly significant given the rising prominence of zero-shot classifiers and their requirement of interpreting complex relationships in multi-modal settings, such as the compound space of text and audio.

Key Contributions and Technical Details

Concept and Methodology: The paper proposes LMAC-ZS, a novel technique involving a decoder that generates saliency maps. These maps highlight influential regions within the audio input that guide the zero-shot classification decision. The approach introduces a new loss function to ensure that the generated interpretations faithfully adhere to the original text-audio pair similarities maintained by the classifier model, in this case, the CLAP (Contrastive Language-Audio Pretraining) model.
Decoder-Based Interpretability: Unlike traditional methods, LMAC-ZS utilizes a decoder conditioned on both textual and auditory features to produce interpretation masks in either the Mel or STFT domain. These masks are structured to maintain the classifier's audio-text similarity matrix, effectively preserving decision-relevant features while filtering out noise or less critical components. The decoder's training leverages cross-modal representations learned from large pre-trained datasets, emphasizing adaption without using explicit class labels during interpretation training.
Empirical Evaluation: The paper conducts comprehensive evaluations using datasets like ESC50 and UrbanSound8K, both with standard and contaminated audio inputs. LMAC-ZS exhibits strong performance across several faithfulness metrics such as Faithfulness on Spectra (FF), Average Increase (AI), and Average Drop (AD). The evaluations indicate that LMAC-ZS explanations robustly mirror classifier decisions, outperforming other saliency map approaches like GradCAM++ and Integrated Gradients in producing relevant and sparse interpretations.
Robustness Checks: The authors employed a range of experiments to ensure the robustness of LMAC-ZS interpretations, including model randomization tests to validate that the saliency maps align closely to learned model parameters rather than arbitrary features. LMAC-ZS demonstrated sensitivity to classification weights, further confirming the faithful reproduction of model reasoning in its interpretation maps.

Implications and Future Directions

The introduction of LMAC-ZS has several theoretical and practical implications for the field of AI and machine learning. Theoretically, it paves the way for exploring interpretability in multi-modal and zero-shot settings, challenging the conventional focus on more simplistic model architectures. Practically, LMAC-ZS could enhance transparency in sensitive applications of audio classifiers, such as in healthcare or autonomous monitoring systems where understanding classifier decisions is as crucial as accuracy.

Looking forward, this research opens pathways for extending interpretability methods across different modalities and models, potentially refining the decoder architecture for improved context adaptability. Moreover, similar approaches could transform how zero-shot and transfer learning models are validated and understood, bridging the gap between model complexity and human interpretability. Future expansions might explore the application scope towards real-time interpretation for dynamic audio processing scenarios, thus contributing to more adaptable and responsive AI solutions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/mirco_ravanelli/status/1841116840535220674

https://twitter.com/fpaissan_/status/1841109286165864664

https://twitter.com/mirco_ravanelli/status/1866191769496391897

https://twitter.com/fpaissan_/status/1866858976086032601

https://twitter.com/AudioAndSpeech/status/1915087311814983908