AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification (2203.13448v3)

Published 25 Mar 2022 in cs.SD and eess.AS

Abstract: After its sweeping success in vision and language tasks, pure attention-based neural architectures (e.g. DeiT) are emerging to the top of audio tagging (AT) leaderboards, which seemingly obsoletes traditional convolutional neural networks (CNNs), feed-forward networks or recurrent networks. However, taking a closer look, there is great variability in published research, for instance, performances of models initialized with pretrained weights differ drastically from without pretraining, training time for a model varies from hours to weeks, and often, essences are hidden in seemingly trivial details. This urgently calls for a comprehensive study since our 1st comparison is half-decade old. In this work, we perform extensive experiments on AudioSet which is the largest weakly-labeled sound event dataset available, we also did an analysis based on the data quality and efficiency. We compare a few state-of-the-art baselines on the AT task, and study the performance and efficiency of 2 major categories of neural architectures: CNN variants and attention-based variants. We also closely examine their optimization procedures. Our opensourced experimental results provide insights to trade-off between performance, efficiency, optimization process, for both practitioners and researchers. Implementation: https://github.com/lijuncheng16/AudioTaggingDoneRight

View on arXiv

Authors (4)

Juncheng B Li (7 papers)
Shuhui Qu (14 papers)
Po-Yao Huang (31 papers)
Florian Metze (80 papers)

Citations (9)

View on Semantic Scholar

Summary

Insightful Overview of "AudioTagging Done Right: 2nd Comparison of Deep Learning Methods for Environmental Sound Classification"

Introduction

The paper "AudioTagging Done Right: 2nd Comparison of Deep Learning Methods for Environmental Sound Classification" by Juncheng B Li et al. explores the current trends and methodologies in audio tagging (AT), focusing specifically on the effectiveness and efficiency of attention-based neural architectures compared to convolutional neural networks (CNNs). This work builds upon the impressive success of attention mechanisms in natural language and vision fields, examining their application to environmental sound classification tasks.

Overview

The researchers investigate several state-of-the-art neural network architectures deployed for AT and compare traditional CNN variants with attention-based models like Vision Transformers (ViT). Their experiments utilize AudioSet, the largest weakly labeled sound event dataset available, providing a robust baseline for comprehensive model evaluation. They address crucial factors such as model performance, efficiency, and optimization strategies, offering insights into trade-offs that can inform future research in audio tagging.

Experimental Setup and Methodologies

The authors have methodically set up a series of experiments to understand the strengths and weaknesses of various neural architectures on the AT task. Here are key elements of their approach:

Dataset Utilization: The AudioSet dataset serves as the primary benchmark. It includes over 2 million 10-second audio samples from YouTube, annotated with 527 sound event labels. Subsets include a balanced dataset and a larger unbalanced one.
Model Variants: The paper examines the performance of four transformer-based architectures—CNN+Transformer, pure Transformer, Vision Transformer (ViT), and Conformer—against traditional CNN and CRNN models. These models are trained under multiple configurations, considering factors like pretraining and learning rate schedules.
Optimization Strategies: The paper explores critical optimization parameters such as learning rate (LR) scheduling, weight decay, momentum, normalization, and data augmentation techniques, probing their impact on model accuracy and training speed.
Data Quality Analysis: The paper explores the effects of data quality on model performance. Quantile-based data analysis evaluates how models respond to varying label quality in the dataset, providing insights into the robustness of different neural architectures.

Primary Findings

Model Performance: Transformer-based models, particularly those with hybrid CNN-Transformer architectures, are competitive, providing comparable or sometimes superior performance to CNNs in certain setups. However, they are notably more sensitive to initialization and optimization parameters.
Pretraining Benefits: While pretraining can be beneficial, especially for ViTs, it is not universally essential. Some architectures can achieve high performance even when trained from scratch, emphasizing the potential for efficiency gains without extensive pretraining.
Optimization and Augmentations: The paper underscores the significance of LR scheduling and the role of data augmentations like Mixup and TimeSpecAug in improving model accuracy. Additionally, custom normalization techniques are shown to substantially affect the training dynamics and final outcomes.
Data Efficiency: Smaller feature sizes drastically improve training and inference speeds, though at a modest cost to accuracy, making them a viable option for exploratory research.

Implications and Future Directions

This detailed comparison of CNN and attention-based architectures provides valuable insights for the development of more efficient and effective sound event classification systems. The findings highlight the computational trade-offs and challenges inherent in model training for audio tagging, suggesting that hybrid models might offer a beneficial balance between localized feature extraction and global pattern recognition.

Future research could leverage these insights to refine transformer models further and explore their application across other types of audio data. Additionally, the elucidation of optimization strategies opens avenues for more robust and adaptable AT systems, potentially combining pretraining techniques with innovative data processing methodologies to achieve state-of-the-art performance across diverse environments.

PDF Markdown