Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation (2204.12260v1)

Published 26 Apr 2022 in eess.AS and cs.SD

Abstract: Recent general-purpose audio representations show state-of-the-art performance on various audio tasks. These representations are pre-trained by self-supervised learning methods that create training signals from the input. For example, typical audio contrastive learning uses temporal relationships among input sounds to create training signals, whereas some methods use a difference among input views created by data augmentations. However, these training signals do not provide information derived from the intact input sound, which we think is suboptimal for learning representation that describes the input as it is. In this paper, we seek to learn audio representations from the input itself as supervision using a pretext task of auto-encoding of masked spectrogram patches, Masked Spectrogram Modeling (MSM, a variant of Masked Image Modeling applied to audio spectrogram). To implement MSM, we use Masked Autoencoders (MAE), an image self-supervised learning method. MAE learns to efficiently encode the small number of visible patches into latent representations to carry essential information for reconstructing a large number of masked patches. While training, MAE minimizes the reconstruction error, which uses the input as training signal, consequently achieving our goal. We conducted experiments on our MSM using MAE (MSM-MAE) models under the evaluation benchmark of the HEAR 2021 NeurIPS Challenge. Our MSM-MAE models outperformed the HEAR 2021 Challenge results on seven out of 15 tasks (e.g., accuracies of 73.4% on CREMA-D and 85.8% on LibriCount), while showing top performance on other tasks where specialized models perform better. We also investigate how the design choices of MSM-MAE impact the performance and conduct qualitative analysis of visualization outcomes to gain an understanding of learned representations. We make our code available online.

PDF Abstract

Overview of Masked Spectrogram Modeling using Masked Autoencoders

The paper "Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation" addresses a self-supervised learning approach aimed at advancing general-purpose audio representation. The research is focused on improving audio representations through the novel use of Masked Spectrogram Modeling (MSM), implemented via Masked Autoencoders (MAE). This approach diverges from traditional audio contrastive learning methods by utilizing the input audio itself for supervision.

Methodology

The paper introduces MSM as an adaptation of Masked Image Modeling to audio spectrograms. This involves transforming audio signals into spectrograms, which are then divided into patches similar to how images are divided in computer vision tasks. MSM masks a significant portion of these patches, allowing the MAE to learn by attempting to reconstruct the masked spectrogram segments from the unmasked patches, therefore aligning audio representation learning more directly with the input data.

Experimental Framework

Experiments were conducted using the HEAR 2021 NeurIPS Challenge benchmark—a suite assessing performance across diverse audio-related tasks. The researchers evaluated multiple MSM-MAE configurations, exploring variations in audio input duration and patch sizes, which revealed that both longer audio inputs and finer time resolutions enhanced model performance in several tasks.

Notably, the MSM-MAE models demonstrated superior performance on seven out of fifteen tasks, achieving accuracies of 73.4% on the CREMA-D task and 85.8% on LibriCount. These results indicate the effectiveness of MSM-MAE in deriving representations that are broadly applicable across various audio processing tasks.

Key Contributions

MSM Development: The paper's primary contribution is the introduction and implementation of Masked Spectrogram Modeling using Masked Autoencoders, setting a foundation for subsequent work in audio representation.
Performance Validation: Through rigorous comparison with existing methods, particularly on the HEAR 2021 Challenge tasks, the MSM-MAE models exhibited competitive performance, notably surpassing many benchmark results in general-purpose applicability.
Design Insights: The research explored the impact of model design choices—specifically input audio duration and patch size—on task performance, yielding insights that can guide future developments in audio representation learning.

Practical and Theoretical Implications

The MSM-MAE framework holds significant potential for various applications, such as improving automated audio analysis systems used in diverse fields from speech recognition to environmental sound classification. Theoretically, the ability to directly leverage input data for self-supervision may inspire further innovations in self-supervised learning methodologies beyond audio processing.

Future Directions

Building on the promising outcomes of this paper, prospective research can focus on refining MSM-MAE models by exploring more complex mask strategies or integrating additional context from multi-modal data inputs. Additionally, expanding the datasets and tasks for evaluation could further validate the robustness and flexibility of the MSM framework across other audio application domains.

In summary, this paper presents a substantial contribution to audio representation learning, offering a novel methodology that leverages the nuances of audio signals more effectively than prior contrastive approaches. As self-supervised learning advances, the MSM-MAE framework could become instrumental in shaping future developments in audio processing technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Daisuke Niizumi (29 papers)
Daiki Takeuchi (30 papers)
Yasunori Ohishi (29 papers)
Noboru Harada (48 papers)
Kunio Kashino (23 papers)

Citations (58)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - nttcslab/msm-mae: Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representations (91 stars)