Improving Anomalous Sound Detection with Attribute-aware Representation from Domain-adaptive Pre-training (2509.12845v2)

Published 16 Sep 2025 in cs.SD and cs.AI

Abstract: Anomalous Sound Detection (ASD) is often formulated as a machine attribute classification task, a strategy necessitated by the common scenario where only normal data is available for training. However, the exhaustive collection of machine attribute labels is laborious and impractical. To address the challenge of missing attribute labels, this paper proposes an agglomerative hierarchical clustering method for the assignment of pseudo-attribute labels using representations derived from a domain-adaptive pre-trained model, which are expected to capture machine attribute characteristics. We then apply model adaptation to this pre-trained model through supervised fine-tuning for machine attribute classification, resulting in a new state-of-the-art performance. Evaluation on the Detection and Classification of Acoustic Scenes and Events (DCASE) 2025 Challenge dataset demonstrates that our proposed approach yields significant performance gains, ultimately outperforming our previous top-ranking system in the challenge.

Summary

The paper introduces a novel two-stage learning strategy combining universal pre-training and domain-adaptive fine-tuning to achieve attribute-aware sound representation.
It employs an agglomerative hierarchical clustering pseudo-labeling method to uncover latent machine attributes from limited annotated data.
Results on the DCASE 2025 dataset demonstrate state-of-the-art performance with improved parameter efficiency and robustness.

Improving Anomalous Sound Detection with Attribute-aware Representation from Domain-adaptive Pre-training

Introduction

The paper presents an innovative approach to Anomalous Sound Detection (ASD), an essential task for machine maintenance and safety protection, where the challenge often lies in the scarcity of annotated anomalous sound data. Traditionally, ASD is tackled as a Machine Attribute Classification (MAC) problem, leveraging only normal sound data. However, the lack of labeled machine attribute data poses a significant barrier, which the paper addresses through a novel methodology involving domain-adaptive pre-training and pseudo-labeling.

Methodology

Domain-adaptive Pre-training

The essence of the proposed method lies in its two-stage learning strategy. Initially, a universal model is pre-trained on a large-scale dataset such as AudioSet, focusing on generic audio characteristics. This stage is crucial as it prepares a robust foundation by learning universal audio representations. Subsequently, to mitigate domain mismatch issues and improve fine-grained representation learning, a domain-adaptive pre-training phase is conducted using multiple machine-specific sound datasets. This step allows the model to develop rich, attribute-aware audio embeddings critical for both the clustering process and downstream ASD task.

Figure 1: The overall framework of our proposed method.

Pseudo-labeling Strategy

The challenge of missing attribute labels is addressed through an agglomerative hierarchical clustering method, which assigns pseudo-attribute labels based on embeddings from the domain-adaptive model. This technique effectively discovers latent machine attributes by exploiting variances in the machine sounds that align with particular operational conditions, thus allowing the model to preserve important intra-class variations that are often lost in conventional approaches.

Figure 2: T-SNE Visualization of the embedding distribution of FT (left) and DAP (right) schemes. Different colors represent different real attributes of machine type Polisher in the source domain, such as (pow1, nA) and (pow3, nB). The pow (power) and n (noise) denote the Polisher's working power and background noise, and the characters after pow and n represent their specific attribute values.

Model Adaptation

In the final adaptation stage, the pre-trained model undergoes supervised fine-tuning on the ASD task, employing both the generated pseudo-attribute labels and any available ground-truth labels. This comprehensive fine-tuning step enables the transfer of learned representations to the specific requirements of the MAC task, culminating in a highly effective ASD system that excels in both attribute-rich and attribute-sparse settings.

Experimental Results

The proposed approach was rigorously evaluated on the DCASE 2025 ASD dataset, showcasing remarkable improvements over existing systems. By integrating domain-adaptive pre-training and pseudo-labeling, the method achieved state-of-the-art results, surpassing previous top-ranking systems in terms of both performance and parameter efficiency.

Figure 3: Comparison among our system and other SOTA models on the DCASE 2025 ASD evaluation dataset.

Conclusion

The research introduces a potent combination of domain-adaptive pre-training and hierarchical clustering for pseudo-labeling, enhancing the performance of ASD systems significantly. Through improved attribute representation and efficient model adaptation, the approach offers substantial advancements in detecting anomalous sounds under challenging conditions. As advancements in unsupervised and self-supervised learning continue, this methodology sets a strong precedent for further exploration and optimization of ASD systems, paving the way for more resilient and nuanced models in industrial applications.