Retrieval-Augmented Approach for Unsupervised Anomalous Sound Detection and Captioning without Model Training (2410.22056v1)
Abstract: This paper proposes a method for unsupervised anomalous sound detection (UASD) and captioning the reason for detection. While there is a method that captions the difference between given normal and anomalous sound pairs, it is assumed to be trained and used separately from the UASD model. Therefore, the obtained caption can be irrelevant to the differences that the UASD model captured. In addition, it requires many caption labels representing differences between anomalous and normal sounds for model training. The proposed method employs a retrieval-augmented approach for captioning of anomalous sounds. Difference captioning in the embedding space output by the pre-trained CLAP (contrastive language-audio pre-training) model makes the anomalous sound detection results consistent with the captions and does not require training. Experiments based on subjective evaluation and a sample-wise analysis of the output captions demonstrate the effectiveness of the proposed method.
- K. Dohi, K. Imoto, N. Harada, D. Niizumi, Y. Koizumi, T. Nishida, H. Purohit, R. Tanabe, T. Endo, and Y. Kawaguchi, “Description and discussion on DCASE 2023 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,” arXiv: 2305.07828, 2023.
- Y. Koizumi, S. Saito, H. Uematsu, Y. Kawachi, and N. Harada, “Unsupervised detection of anomalous sound based on deep learning and the Neyman–Pearson lemma,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 212–224, 2019.
- K. Suefusa, T. Nishida, H. Purohit, R. Tanabe, T. Endo, and Y. Kawaguchi, “Anomalous sound detection based on interpolation deep neural network,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 271–275.
- A. Almudévar, A. Ortega, L. Vicente, A. Miguel, and E. Lleida, “Variational classifier for unsupervised anomalous sound detection under domain generalization,” in Proc. INTERSPEECH, 2023, pp. 2823–2827.
- K. Shimonishi, K. Dohi, and Y. Kawaguchi, “Anomalous sound detection based on sound separation,” in Proc. INTERSPEECH, 2023, pp. 2733–2737.
- S. Tsubaki, Y. Kawaguchi, T. Nishida, K. Imoto, Y. Okamoto, K. Dohi, and T. Endo, “Audio-change captioning to explain machine-sound anomalies,” in Proc. of the 8th Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Tampere, Finland, September 2023, pp. 201–205.
- K. Dohi, T. Nishida, H. Purohit, R. Tanabe, T. Endo, M. Yamamoto, Y. Nikaido, and Y. Kawaguchi, “MIMII DG: Sound dataset for malfunctioning industrial machine investigation and inspection for domain generalization task,” in Proc. of the 7th Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), Nancy, France, November 2022.
- B. Elizalde, S. Deshmukh, and H. Wang, “Natural language supervision for general-purpose audio representations,” arXiv:2309.05767, 2023.
- OpenAI, J. Achiam, et al., “GPT-4 technical report,” arXiv:2303.08774, 2023.
- X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv:2303.17395, 2023.
- J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 776–780.
- K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 646–650.
- L. Zhuang, L. Wayne, S. Ya, and Z. Jun, “A robustly optimized BERT pre-training approach with post-training,” in Proc. of the 20th Chinese National Conference on Computational Linguistics. Huhhot, China: Chinese Information Processing Society of China, Aug. 2021, pp. 1218–1227.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” Open AI blog, 2019.
- S. Ghosh, S. Kumar, C. K. R. Evuru, R. Duraiswami, and D. Manocha, “RECAP: Retrieval-augmented audio captioning,” arXiv:2309.09836, 2023.
- P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proc. Advances in Neural Information Processing Systems (Neurips), vol. 33. Curran Associates, Inc., 2020, pp. 9459–9474.
- Y. Koizumi, Y. Kawaguchi, K. Imoto, T. Nakamura, Y. Nikaido, R. Tanabe, H. Purohit, K. Suefusa, T. Endo, M. Yasuda, and N. Harada, “Description and discussion on DCASE2020 challenge task2: Unsupervised anomalous sound detection for machine condition monitoring,” in Proc. of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE), November 2020, pp. 81–85.
- H. Purohit, R. Tanabe, T. Ichige, T. Endo, Y. Nikaido, K. Suefusa, and Y. Kawaguchi, “MIMII Dataset: Sound dataset for malfunctioning industrial machine investigation and inspection,” in Proc. Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019), November 2019, pp. 209–213.
- Y. Koizumi, S. Saito, H. Uematsu, N. Harada, and K. Imoto, “ToyADMOS: A dataset of miniature-machine operating sounds for anomalous sound detection,” in Proc. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), November 2019, pp. 308–312.
- Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.
- Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.