PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation (2102.01243v3)

Published 2 Feb 2021 in cs.SD, cs.LG, and eess.AS

Abstract: Audio tagging is an active research area and has a wide range of applications. Since the release of AudioSet, great progress has been made in advancing model performance, which mostly comes from the development of novel model architectures and attention modules. However, we find that appropriate training techniques are equally important for building audio tagging models with AudioSet, but have not received the attention they deserve. To fill the gap, in this work, we present PSLA, a collection of training techniques that can noticeably boost the model accuracy including ImageNet pretraining, balanced sampling, data augmentation, label enhancement, model aggregation and their design choices. By training an EfficientNet with these techniques, we obtain a single model (with 13.6M parameters) and an ensemble model that achieve mean average precision (mAP) scores of 0.444 and 0.474 on AudioSet, respectively, outperforming the previous best system of 0.439 with 81M parameters. In addition, our model also achieves a new state-of-the-art mAP of 0.567 on FSD50K.

Authors (3)

Yuan Gong (45 papers)
Yu-An Chung (33 papers)
James Glass (173 papers)

Citations (132)

View on Semantic Scholar

Summary

The paper introduces PSLA, a framework of pretraining, sampling, labeling, and aggregation techniques that significantly improve audio tagging performance using existing model architectures and large datasets like AudioSet.
Key PSLA techniques include leveraging ImageNet pretraining for strong generalization, using balanced sampling and mix-up for robustness, and applying label enhancement to address dataset annotation errors.
Ablation studies confirm each PSLA technique contributes to improved performance on AudioSet and FSD50K, highlighting the importance of training paradigms over solely architectural innovations for efficient audio tagging.

Enhancing Audio Tagging through Training Techniques: Insights from "PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation"

The paper "PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation" by Yuan Gong, Yu-An Chung, and James Glass introduces a comprehensive set of training techniques that demonstratively improve audio tagging performances. It focuses on leveraging AudioSet—a large dataset featuring audio clips tagged with various sound events—through innovative methods that do not necessitate altering existing model architectures.

The PSLA framework centralizes around four pivotal training methodologies: pretraining, sampling, labeling, and aggregation. Notably, the authors emphasize the significance of ImageNet pretraining for convolutional neural networks (CNNs), finding that models pretrained on visual tasks surprisingly yield better generalizability and accuracy for audio tagging. Despite AudioSet's substantial size, the technique contributed an improvement in mean average precision (mAP), indicating the utility of knowledge transfer across different modalities.

Further, recognizing the class imbalance inherent in AudioSet, the paper discusses balanced sampling and data augmentation strategies. A unique random balanced sampling algorithm was proposed, working with time-frequency masking and mix-up training methods to create a more balanced exposure of classes to learning processes. Here, mix-up training demonstrated increased variations in training data, which mitigated overfitting and improved model robustness to label noise.

The researchers also delved into the pervasive annotation errors within AudioSet. By examining the labeling discrepancies, especially prevalent in superclass–subclass annotations like those involving speech categories, they present a label enhancement technique leveraging the AudioSet ontology. This approach redefines labels of training data to improve model training alignment with actual audio event occurrences, noticeably aiding the model’s performance on smaller datasets despite the paper’s observation of minor gains on the full dataset possibly due to mismatched evaluation labels.

Lastly, the paper employs weight averaging and ensemble methods, amalgamating outputs of model checkpoints to enhance performance further. Such aggregation methodologies, including checkpoint predictions and different random seed models, reinforce the effectiveness of diversified model training paths, underscoring the idea of using ensembles to overcome individual model limitations.

An ablation paper confirmed that each training technique was indeed beneficial, yielding improved mAP scores across AudioSet and offering substantial gains when applied to FSD50K—a supplementary audio tagging dataset. The results underscore that these training strategies, while seamlessly integrable with existing models like EfficientNet or ResNet, are fundamental in vaulting over traditional architecture-driven improvements.

In terms of implications, the PSLA techniques pioneer a more profound understanding of training paradigms that could surpass mere architecture design, suggesting new areas for exploration within AI audio processing. By fulfilling the paper’s purpose of optimizing a relatively parameter-light model to exceed the performance of heftier models, the authors set a precedence for efficient large-scale model deployment, particularly pertinent for real-world applications where computational resources are a concern.

This research contributes a valuable framework for researchers aiming to improve audio tagging with existing datasets and models and prompts a reevaluation of dataset preparation and model aggregation advantages in AI. The PSLA insights advance the field by indicating that optimal training procedures can drive innovation just as significantly as novel architectures, heralding potential developments in ensemble learning, transfer learning, and label improvement methodologies for future AI research.

Related Papers

YouTube

Show All Videos