PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition (1912.10211v5)

Published 21 Dec 2019 in cs.SD and eess.AS

Abstract: Audio pattern recognition is an important research topic in the machine learning area, and includes several tasks such as audio tagging, acoustic scene classification, music classification, speech emotion classification and sound event detection. Recently, neural networks have been applied to tackle audio pattern recognition problems. However, previous systems are built on specific datasets with limited durations. Recently, in computer vision and natural language processing, systems pretrained on large-scale datasets have generalized well to several tasks. However, there is limited research on pretraining systems on large-scale datasets for audio pattern recognition. In this paper, we propose pretrained audio neural networks (PANNs) trained on the large-scale AudioSet dataset. These PANNs are transferred to other audio related tasks. We investigate the performance and computational complexity of PANNs modeled by a variety of convolutional neural networks. We propose an architecture called Wavegram-Logmel-CNN using both log-mel spectrogram and waveform as input feature. Our best PANN system achieves a state-of-the-art mean average precision (mAP) of 0.439 on AudioSet tagging, outperforming the best previous system of 0.392. We transfer PANNs to six audio pattern recognition tasks, and demonstrate state-of-the-art performance in several of those tasks. We have released the source code and pretrained models of PANNs: https://github.com/qiuqiangkong/audioset_tagging_cnn.

Citations (959)

View on Semantic Scholar

Summary

The paper introduces PANNs, pretrained on 1.9M AudioSet clips covering 527 sound classes to improve audio pattern recognition.
It proposes the Wavegram-Logmel-CNN architecture combining learnable Wavegrams and log-mel spectrograms, achieving a mAP of 0.439.
Extensive evaluations and transfer learning experiments show that PANNs outperform models trained from scratch on tasks like ESC-50 and GTZAN.

An Investigation into Pretrained Audio Neural Networks for Audio Pattern Recognition

The paper "PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition" presents a comprehensive paper on the development and application of Pretrained Audio Neural Networks (PANNs) trained on the AudioSet dataset. The research primarily focuses on improving audio pattern recognition tasks by leveraging large-scale pretraining and subsequently applying these pretrained models to a variety of audio-related tasks.

Core Contributions

Introduction of PANNs: The authors propose the concept of PANNs, pretrained on the extensive AudioSet dataset, consisting of 1.9 million audio clips encompassing 527 sound classes. This method is inspired by the advancements seen in image and language processing with systems trained on large-scale datasets like ImageNet and Wikipedia.
Wavegram-CNN and Wavegram-Logmel-CNN: A significant innovation in this work is the introduction of the Wavegram-Logmel-CNN architecture. By combining log-mel spectrograms and Wavegrams (a learnable time-frequency representation derived from waveforms), the proposed system achieved a Mean Average Precision (mAP) of 0.439, surpassing previous state-of-the-art methods.
Extensive Evaluation and Comparison: The paper meticulously evaluates different architectures, including conventional CNNs, ResNets, MobileNets, and one-dimensional CNNs on the AudioSet tagging task. The results indicate that deeper networks (e.g., ResNet38) outperform shallower ones, with notable trade-offs in computational complexity.
Pretraining Benefits and Transfer Learning: A substantial part of the paper is devoted to the transfer of PANNs to other audio tasks such as ESC-50, DCASE 2019 Task 1, and GTZAN. The pretrained models, when fine-tuned on these specific tasks, consistently outperformed models trained from scratch, affirming the effectiveness of pretraining on large-scale datasets.
Analysis of Data Processing Techniques: The research highlights the impact of data augmentation techniques like mixup and SpecAugment, alongside data balancing strategies, on the performance of PANNs. For instance, applying mixup on log-mel spectrograms improved the mAP from 0.416 to 0.431.

Theoretical Implications

The findings suggest that large-scale pretraining on diverse datasets such as AudioSet can significantly enhance the generalization capability of models across various audio pattern recognition tasks. The use of learnable features like Wavegrams demonstrates the potential to replace or augment traditional handcrafted features, leading to better performance.

Practical Implications

Practically, the paper paves the way for employing pretrained audio models in applications requiring robust audio pattern recognition. The results from MobileNets are particularly noteworthy for their potential in resource-constrained environments, highlighting a balance between performance and computational efficiency.

Future Directions

Further Exploration of Wavegrams: The interplay between learnable time-frequency representations and traditional spectrograms shows promise. Future research could focus on refining Wavegram architectures and exploring their application in different acoustic domains.
Expanding Dataset Scope: Given the success with AudioSet, extending the pretraining to even larger and more diverse datasets could further improve the robustness and versatility of PANNs.
Enhancing Transfer Learning Techniques: Fine-tuning strategies could be optimized to better preserve the general features learned during pretraining while adapting precisely to new tasks.

Conclusion

This paper offers a significant advancement in the field of audio pattern recognition by introducing PANNs and demonstrating their effectiveness through rigorous experimentation. By achieving state-of-the-art results on multiple benchmarks and proposing innovative architectures, it sets a strong foundation for the future development of robust, scalable, and efficient audio recognition systems. The work serves as a benchmark for future research aiming to leverage large-scale pretraining in audio and potentially other sensory domains.

PDF Markdown