HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection (2202.00874v1)

Published 2 Feb 2022 in cs.SD, cs.AI, cs.IR, cs.LG, and eess.AS

Abstract: Audio classification is an important task of mapping audio samples into their corresponding labels. Recently, the transformer model with self-attention mechanisms has been adopted in this field. However, existing audio transformers require large GPU memories and long training time, meanwhile relying on pretrained vision models to achieve high performance, which limits the model's scalability in audio tasks. To combat these problems, we introduce HTS-AT: an audio transformer with a hierarchical structure to reduce the model size and training time. It is further combined with a token-semantic module to map final outputs into class featuremaps, thus enabling the model for the audio event detection (i.e. localization in time). We evaluate HTS-AT on three datasets of audio classification where it achieves new state-of-the-art (SOTA) results on AudioSet and ESC-50, and equals the SOTA on Speech Command V2. It also achieves better performance in event localization than the previous CNN-based models. Moreover, HTS-AT requires only 35% model parameters and 15% training time of the previous audio transformer. These results demonstrate the high performance and high efficiency of HTS-AT.

Authors (6)

Ke Chen (241 papers)
Xingjian Du (25 papers)
Bilei Zhu (11 papers)
Zejun Ma (78 papers)
Taylor Berg-Kirkpatrick (106 papers)
Shlomo Dubnov (40 papers)

Citations (231)

View on Semantic Scholar

Summary

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

The paper discusses the development of HTS-AT, an audio transformer designed to address several limitations inherent in traditional transformer models used for audio classification tasks. These limitations include excessive GPU memory requirements, prolonged training durations, and dependency on pretrained vision models for high performance. HTS-AT is presented as a scalable solution for audio tasks, featuring a hierarchical structure and a token-semantic module aimed at reducing model size and training time.

Key Contributions and Results

HTS-AT achieves new state-of-the-art results on prominent datasets such as AudioSet and ESC-50, while matching the best performance on Speech Command V2. Specifically, impressive metrics include a mean average precision (mAP) of 0.471 on AudioSet, outperforming the previous best at 0.459. In addition to its classification capabilities, HTS-AT has demonstrated superior performance in event localization compared to preceding CNN-based models. Notably, HTS-AT accomplishes these results while utilizing only 35% of the model parameters and 15% of the training time compared to prior audio transformers, establishing its efficacy and efficiency.

Model Architecture and Innovations

The primary innovation in HTS-AT is its hierarchical transformer architecture teamed with window attention mechanisms. This structure minimizes computational demands by reducing the attention matrix size, ultimately lowering GPU memory usage and training time. The model processes the audio spectrogram as patch tokens, enhancing the capture of temporal and frequency relationships. A patch-merge layered approach further enables dimensionality reduction.

The token-semantic module distinguishes the HTS-AT by enabling frame-level event localization, a feature often unattainable by previous transformers that leverage class-token based prediction. This adaptation aligns the model's capabilities with CNN-based approaches, typically known for their localization prowess.

Implications and Future Directions

HTS-AT offers significant advancements for audio classification and detection tasks by demonstrating improved accuracy, scalability, and computational efficiency. The hierarchical design and semantic analysis capabilities present in the HTS-AT may inspire future models to adopt similar architectures, particularly in fields that require the efficient processing of sequential data.

Looking ahead, future work could explore utilizing the recently released partial strong-labeled AudioSet for more precise localization training. Further integration of the HTS-AT model into downstream applications, such as music recommendation systems or keyword spotting technologies, may extend its utility and uncover new challenges.

Overall, HTS-AT represents a meaningful stride in audio processing technology, providing a robust framework for real-time applications and efficient deployments in resource-constrained environments. The adoption of such architectural strategies in other domains of AI continues to promise exciting developments in the field.

PDF Markdown

Related Papers

GitHub

GitHub - RetroCirce/HTS-Audio-Transformer: The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection" (415 stars)