Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Retrieval-Augmented Approach for Unsupervised Anomalous Sound Detection and Captioning without Model Training (2410.22056v1)

Published 29 Oct 2024 in eess.AS and cs.SD

Abstract: This paper proposes a method for unsupervised anomalous sound detection (UASD) and captioning the reason for detection. While there is a method that captions the difference between given normal and anomalous sound pairs, it is assumed to be trained and used separately from the UASD model. Therefore, the obtained caption can be irrelevant to the differences that the UASD model captured. In addition, it requires many caption labels representing differences between anomalous and normal sounds for model training. The proposed method employs a retrieval-augmented approach for captioning of anomalous sounds. Difference captioning in the embedding space output by the pre-trained CLAP (contrastive language-audio pre-training) model makes the anomalous sound detection results consistent with the captions and does not require training. Experiments based on subjective evaluation and a sample-wise analysis of the output captions demonstrate the effectiveness of the proposed method.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. K. Dohi, K. Imoto, N. Harada, D. Niizumi, Y. Koizumi, T. Nishida, H. Purohit, R. Tanabe, T. Endo, and Y. Kawaguchi, “Description and discussion on DCASE 2023 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,” arXiv: 2305.07828, 2023.
  2. Y. Koizumi, S. Saito, H. Uematsu, Y. Kawachi, and N. Harada, “Unsupervised detection of anomalous sound based on deep learning and the Neyman–Pearson lemma,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 212–224, 2019.
  3. K. Suefusa, T. Nishida, H. Purohit, R. Tanabe, T. Endo, and Y. Kawaguchi, “Anomalous sound detection based on interpolation deep neural network,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 271–275.
  4. A. Almudévar, A. Ortega, L. Vicente, A. Miguel, and E. Lleida, “Variational classifier for unsupervised anomalous sound detection under domain generalization,” in Proc. INTERSPEECH, 2023, pp. 2823–2827.
  5. K. Shimonishi, K. Dohi, and Y. Kawaguchi, “Anomalous sound detection based on sound separation,” in Proc. INTERSPEECH, 2023, pp. 2733–2737.
  6. S. Tsubaki, Y. Kawaguchi, T. Nishida, K. Imoto, Y. Okamoto, K. Dohi, and T. Endo, “Audio-change captioning to explain machine-sound anomalies,” in Proc. of the 8th Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Tampere, Finland, September 2023, pp. 201–205.
  7. K. Dohi, T. Nishida, H. Purohit, R. Tanabe, T. Endo, M. Yamamoto, Y. Nikaido, and Y. Kawaguchi, “MIMII DG: Sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task,” in Proc. of the 7th Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Nancy, France, November 2022.
  8. B. Elizalde, S. Deshmukh, and H. Wang, “Natural language supervision for general-purpose audio representations,” arXiv:2309.05767, 2023.
  9. OpenAI, J. Achiam, et al., “GPT-4 technical report,” arXiv:2303.08774, 2023.
  10. X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv:2303.17395, 2023.
  11. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780.
  12. K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 646–650.
  13. L. Zhuang, L. Wayne, S. Ya, and Z. Jun, “A robustly optimized BERT pre-training approach with post-training,” in Proc. of the 20th Chinese National Conference on Computational Linguistics.   Huhhot, China: Chinese Information Processing Society of China, Aug. 2021, pp. 1218–1227.
  14. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” Open AI blog, 2019.
  15. S. Ghosh, S. Kumar, C. K. R. Evuru, R. Duraiswami, and D. Manocha, “RECAP: Retrieval-augmented audio captioning,” arXiv:2309.09836, 2023.
  16. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proc. Advances in Neural Information Processing Systems (Neurips), vol. 33.   Curran Associates, Inc., 2020, pp. 9459–9474.
  17. Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura, Y. Nikaido, R. Tanabe, H. Purohit, K. Suefusa, T. Endo, M. Yasuda, and N. Harada, “Description and discussion on DCASE2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring,” in Proc. of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), November 2020, pp. 81–85.
  18. H. Purohit, R. Tanabe, T. Ichige, T. Endo, Y. Nikaido, K. Suefusa, and Y. Kawaguchi, “MIMII Dataset: Sound dataset for malfunctioning industrial machine investigation and inspection,” in Proc. Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), November 2019, pp. 209–213.
  19. Y. Koizumi, S. Saito, H. Uematsu, N. Harada, and K. Imoto, “ToyADMOS: A dataset of miniature-machine operating sounds for anomalous sound detection,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), November 2019, pp. 308–312.
  20. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
  21. Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.

Summary

  • The paper introduces a retrieval-augmented approach that unifies unsupervised sound anomaly detection with difference captioning using pre-trained CLAP embeddings.
  • It employs a k-nearest neighbors strategy and text decoding to bypass extensive task-specific training while ensuring coherent anomaly descriptions.
  • Experimental results demonstrate competitive performance, offering practical benefits for predictive maintenance and real-time anomaly reporting.

Assessment of a Retrieval-Augmented Approach for Unsupervised Anomalous Sound Detection and Captioning

In this paper, the authors introduce a novel methodology for unsupervised anomalous sound detection (UASD) paired with the generation of textual descriptions explaining the anomalies. Their approach significantly deviates from traditional methods by eliminating the need for model training specific to the task of difference captioning. This is achieved through leveraging a pre-trained architecture, namely the contrastive language-audio pre-training (CLAP) model.

Methodology

The authors critique existing models, which separate the UASD framework from the process of captioning the differences between normal and abnormal sound events. These methods often require extensive annotated data and may lead to disconnected or irrelevant captions relative to the detected anomalies. The proposed approach overcomes these limitations by utilizing a retrieval-augmented generation (RAG) method. Specifically, it simultaneously addresses detection and captioning within the same framework using CLAP, thereby ensuring coherence between the UASD outcomes and the generated captions.

The CLAP model serves as the backbone for their methodology. This model outputs embeddings for audio inputs that are foundational to both classification in UASD, using k-nearest neighbors (k-NN) for determining anomaly scores, and captioning. The innovation lies in the use of pre-generated embeddings from a large, pre-trained dataset, negating the necessity for additional training specific to industrial audio data. The method capitalizes on a text decoder to generate initial captions for both normal and anomalous sounds. Then, a more refined caption describing the differences is produced by comparing the outputs using GPT-4.

Experimental Results

The paper applies the proposed methodology to benchmark datasets, demonstrating competitive performance in UASD tasks when compared with state-of-the-art models. Importantly, the method shows efficiency in generating meaningful captions that reflect real-world anomaly causes, as validated through subjective evaluations. The authors introduce two captioning approaches: one based on text decoding and another leveraging predefined descriptors with zero-shot classification. The latter shows utility in scenarios where the former falls short, particularly in cases where the audio embeddings' differences aren't adequately captured through textual description alone.

Implications and Future Work

This research advances the field by integrating anomaly detection and descriptive intelligence for audio anomaly tasks, effectively broadening the application scope for unsupervised models in industrial settings. The potential implications are substantial for predictive maintenance and real-time anomaly reporting systems, where understanding the nature of an anomaly can significantly impact decision-making processes.

Future work could explore extended use cases across diverse audio domains, further refine the integration of audio and textual embeddings, and enhance the retrieval-augmented approach by incorporating more comprehensive multimodal datasets. Additionally, investigations into optimizing computational efficiency could be valuable for real-time implementation.

Conclusion

This paper presents a comprehensive strategy to merge anomaly detection with descriptive explanation generation in unsupervised settings without additional training. It positions the retrieval-augmented approach using CLAP embeddings as a robust tool compatible with varied audio tasks, fostering advancements in the intersection of audio surveillance and natural language generation.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.