AST: Audio Spectrogram Transformer (2104.01778v3)

Published 5 Apr 2021 in cs.SD and cs.AI

Abstract: In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

PDF Abstract

Audio Spectrogram Transformer: A Convolution-Free Approach to Audio Classification

The paper "AST: Audio Spectrogram Transformer" introduces a novel approach to audio classification by eliminating the reliance on convolutional neural networks (CNNs) and adopting a purely attention-based model. This innovative architecture is termed the Audio Spectrogram Transformer (AST) and offers a new perspective on handling audio data, departing from the traditional CNN-based models dominant in the field.

Background and Motivation

For over a decade, CNNs have been extensively used in audio classification, leveraging their spatial locality and translation equivariance properties. Despite their success, recent explorations in attention mechanisms suggest the potential for pure attention models to capture long-range dependencies effectively. Inspired by the success of the Vision Transformer (ViT) in computer vision, the authors question the necessity of CNNs for audio tasks and propose the AST to investigate this hypothesis.

Model Architecture

The AST utilizes an innovative approach by applying a Transformer model directly to audio spectrograms. The spectrograms are split into overlapping $16\times16$ patches, which are then projected into a sequence of embeddings. These are processed by a Transformer encoder, leveraging self-attention across the entire input sequence. This method allows the AST to retain and process long-range contextual information from early layers, unlike CNNs that rely on localized operations and hierarchical feature extraction.

ImageNet Pretraining

One of the challenges with Transformer models is their requirement for large datasets to achieve high performance. To overcome this, the authors employ transfer learning, initializing AST with weights from an ImageNet-pretrained Vision Transformer. By adapting the positional embeddings and maintaining architectural compatibility, AST benefits from spatial information learned in the vision domain, which enhances its performance on audio classification tasks even with limited audio data.

Empirical Evaluation

The AST is evaluated using several benchmarks, including AudioSet, ESC-50, and Speech Commands V2. The results are notable:

AudioSet: AST achieves a mean average precision (mAP) of 0.485, surpassing previous state-of-the-art systems in both single and ensemble model configurations.
ESC-50: AST records a top accuracy of 95.6%, outperforming existing techniques when leveraging AudioSet pretraining.
Speech Commands V2: AST achieves an accuracy of 98.1%, setting a new benchmark for speech command recognition.

These results confirm the efficacy of AST in handling diverse audio classification tasks with varying input lengths and content, demonstrating its robustness and flexibility across different domains.

Ablation Studies

Extensive ablation studies are conducted to justify design choices and assess the impacts of different architectural configurations. The findings highlight:

ImageNet Pretraining: Provides significant performance gains, particularly with smaller training datasets.
Patch Overlap: Enhances performance by increasing the number of patches, thereby capturing more fine-grained information.
Positional Embedding: Proper adaptation of spatial embeddings from ViT is crucial for retaining the transferred knowledge.

Implications and Future Directions

The AST model challenges the prevailing reliance on CNNs in audio classification, presenting a simpler yet highly effective alternative. Its demonstrated ability to outperform CNN-attention hybrid models highlights the potential of Transformers in audio processing tasks. The authors suggest further exploration into purely attention-based architectures to unlock new potentials in audio-related applications.

Another intriguing direction involves applying AST to broader audio processing tasks like audio event localization and scene analysis, potentially extending its utility beyond classification. Additionally, further research could explore integrating this architecture into real-time and resource-constrained environments to assess its applicability and efficiency.

In conclusion, the Audio Spectrogram Transformer represents a promising advancement in audio classification, offering a new paradigm that leverages the strengths of Transformers. Its success across multiple benchmarks establishes it as a formidable tool for future audio analysis and machine learning tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Yuan Gong (45 papers)
Yu-An Chung (33 papers)
James Glass (173 papers)

Citations (702)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/snats_xyz/status/1781327775170744807

YouTube

Show All Videos