SSAST: Self-Supervised Audio Spectrogram Transformer (2110.09784v2)

Published 19 Oct 2021 in cs.SD, cs.AI, and eess.AS

Abstract: Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-art results on various audio classification benchmarks. However, pure Transformer models tend to require more training data compared to CNNs, and the success of the AST relies on supervised pretraining that requires a large amount of labeled data and a complex training pipeline, thus limiting the practical usage of AST. This paper focuses on audio and speech classification, and aims to reduce the need for large amounts of labeled data for AST by leveraging self-supervised learning using unlabeled data. Specifically, we propose to pretrain the AST model with joint discriminative and generative masked spectrogram patch modeling (MSPM) using unlabeled audio from AudioSet and Librispeech. We evaluate our pretrained models on both audio and speech classification tasks including audio event classification, keyword spotting, emotion recognition, and speaker identification. The proposed self-supervised framework significantly boosts AST performance on all tasks, with an average improvement of 60.9%, leading to similar or even better results than a supervised pretrained AST. To the best of our knowledge, it is the first patch-based self-supervised learning framework in the audio and speech domain, and also the first self-supervised learning framework for AST.

PDF Abstract

Self-Supervised Audio Spectrogram Transformer: A Summary

The paper "SSAST: Self-Supervised Audio Spectrogram Transformer" presents an exploration into reducing the dependency of pure Transformer-based audio and speech classification models on large labeled datasets by leveraging self-supervised learning methodologies. This work builds on the Audio Spectrogram Transformer (AST), which demonstrated state-of-the-art results but required significant supervised pretraining on labeled datasets.

Key Contributions

Masked Spectrogram Patch Modeling (MSPM): The authors propose a novel pretraining framework that utilizes both discriminative and generative tasks for self-supervised learning. The approach focuses on masked spectrogram patch modeling, emphasizing the learning of both temporal and frequency structures of audio data. This framework masks random spectrogram patches and challenges the model to reconstruct them using the surrounding context.
Expanding Data Domains: To ensure generalization across different audio classification tasks, the pretraining utilizes both AudioSet and Librispeech datasets, targeting diverse audio events and speech data.
Performance Impact: Through the proposed MSPM, the AST sees a significant performance boost across various benchmarks—a reported average improvement of 60.9% in performance over models trained from scratch. The SSAST approaches or surpasses the performance of prior models backed by supervised pretraining tasks.
Model Generalization: The research illustrates that combining audio and speech datasets in pretraining enhances model generalization capabilities, allowing effective performance across multiple domains, unlike models pretrained in a single domain.

Experimental Insights

The paper conducts extensive experiments to validate the designed framework. The evaluation is carried out on six benchmarks: AudioSet-20K, ESC-50, Speech Commands V1 and V2, VoxCeleb 1, and IEMOCAP. These tasks cover a spectrum from audio event classification to speech emotion recognition, ensuring a comprehensive assessment across auditory and speech tasks.

Fine-Tuning Efficiency: Fine-tuning is done end-to-end, with the experimental setup adapted from previous works for consistency. The improvements observed in SSAST over traditional AST configurations suggest that the model not only requires less labeled data but also converges faster during training.
Discriminative vs. Generative Pretext Tasks: The combination of discriminative and generative pretext tasks offered better performance compared to utilizing either approach independently, underscoring the complementary strengths of both learning objectives.
Model Size and Patch Shape: The experiments also reveal the impact of model size and the choice of spectrogram patch shape. Larger models demonstrate better performance post-pretraining, highlighting the significance of MSPM in unleashing model capacity. Additionally, while frame-based patches initially perform well, patch-based models excel post-pretraining in scenarios requiring an understanding of the complete audio frequency context.
Comparison with Prevailing Self-Supervised Models: SSAST's competitive or superior performance relative to state-of-the-art models like wav2vec and HuBERT further validates its design and implementation choices, although it's noted these comparisons were performed under different resource constraints.

Implications and Future Work

This paper broadens the scope of self-supervised learning in audio and speech processing by rendering large volumes of unlabeled data more utilizable, effectively minimizing the need for expensive manual annotations. Practically, it implies improved scalability and applicability of audio classification models across a multitude of tasks and devices. Theoretically, it suggests fruitful avenues for exploring diverse pretraining tasks, dataset compositions, and architectural innovations within Transformer frameworks.

The potential future directions noted in the paper involve scaling up with larger batches and greater computational resources, further customizing patch sizes and shapes, and exploring cross-domain compatibility. These are aimed at addressing limitations arising from resource constraints and maximizing the applicability of self-supervised learning principles in audio classification and other related fields.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Yuan Gong (45 papers)
Cheng-I Jeff Lai (9 papers)
Yu-An Chung (33 papers)
James Glass (173 papers)

Citations (233)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/vvsotnikov2/status/1912782647950032902