General audio tagging with ensembling convolutional neural network and statistical features (1810.12832v1)

Published 30 Oct 2018 in cs.CV

Abstract: Audio tagging aims to infer descriptive labels from audio clips. Audio tagging is challenging due to the limited size of data and noisy labels. In this paper, we describe our solution for the DCASE 2018 Task 2 general audio tagging challenge. The contributions of our solution include: We investigated a variety of convolutional neural network architectures to solve the audio tagging task. Statistical features are applied to capture statistical patterns of audio features to improve the classification performance. Ensemble learning is applied to ensemble the outputs from the deep classifiers to utilize complementary information. a sample re-weight strategy is employed for ensemble training to address the noisy label problem. Our system achieves a mean average precision (mAP@3) of 0.958, outperforming the baseline system of 0.704. Our system ranked the 1st and 4th out of 558 submissions in the public and private leaderboard of DCASE 2018 Task 2 challenge. Our codes are available at https://github.com/Cocoxili/DCASE2018Task2/.

Citations (30)

View on Semantic Scholar

Summary

The paper demonstrates significant mAP@3 improvements by integrating ensemble CNN models with handcrafted statistical features.
It employs a robust two-level ensemble method that combines predictions from various CNN architectures using a GBDT for enhanced accuracy.
It introduces a novel sample re-weighting strategy to mitigate noisy labels and improve the overall reliability of audio tagging.

Evaluation of General Audio Tagging through Ensemble Convolutional Neural Networks

The paper "GENERAL AUDIO TAGGING WITH ENSEMBLING CONVOLUTIONAL NEURAL NETWORKS AND STATISTICAL FEATURES" presents a methodological approach to tackling the general audio tagging challenge, particularly in the context of the DCASE 2018 Task 2 competition. The proposed system demonstrates significant improvements in mean average precision (mAP@3) over the baseline, with strategic use of convolutional neural networks (CNNs) and ensemble learning techniques.

Methodological Contributions

The authors investigate a variety of CNN architectures to enhance the audio tagging task performance. Their exploration includes popular models such as VGG, Inception, ResNet, DenseNet, ResNeXt, SE-ResNeXt, and DPN. Each model derives from methods previously successful in computer vision tasks, leveraging their efficacy in extracting features from audio data. The models were tested both with randomly initialized weights and pre-trained models from image datasets, showing that pre-trained models generally yielded better performance.

They incorporate statistical features into their system, including skewness and kurtosis of Mel-frequency cepstral coefficients (MFCCs), recognizing that these handcrafted features capture audio characteristics that CNNs might not readily learn. These features provide complementary insights that enhance classification accuracy when combined with deep learning models in the ensemble framework.

Ensemble Learning Approach

The ensemble learning strategy is central to the authors' methodology. They employ a scalable stacked generalization framework where Level 1 includes deep learning predictions from various CNNs, while Level 2 integrates these predictions using a Gradient Boosting Decision Tree (GBDT) to exploit meta-features. This two-level architecture capitalizes on the diversity of single-model predictions, making it computationally scalable and robust to overfitting—a likely issue given the limited data size.

Furthermore, the authors introduce a novel sample re-weighting strategy to address the noisy label problem inherent in the dataset, particularly affecting non-manually verified samples. This strategy adjusts the impact of non-verified samples in the loss function through a re-weighting hyper-parameter, thereby enhancing model robustness against mislabelled data points.

Results and Implications

The ensemble system achieved a notable mAP@3 score of 0.958 in the DCASE 2018 Task 2 challenge, demonstrating superior performance over the provided baseline system. Specifically, they rank first and fourth on the public and private leaderboards, respectively, amongst 558 submissions.

Such results endorse the potential of integrating diverse CNN architectures and leveraging ensemble learning methods to address challenges in audio tagging. The work paves the way for future research that could replicate this approach in larger datasets like Google's AudioSet, suggesting scalability and adaptability of the methodology. Future research directions could focus on further optimizing ensemble strategies and expanding the dataset to further improve model robustness and precision.

Conclusion

This paper makes a valuable contribution to the field of audio tagging by demonstrating an effective ensemble approach, incorporating both deep learning and handcrafted statistical features. It provides a foundation for future research in audio classification tasks and offers practical insights into handling noisy datasets through innovative re-weighting techniques. The promising results achieved in the DCASE 2018 Task 2 challenge reflect the efficacy of the proposed methodology.

PDF Markdown

Related Papers

GitHub

GitHub - Cocoxili/DCASE2018Task2: Team NUDT code for DCASE2018Task2. (75 stars)
GitHub - Cocoxili/DCASE2018Task2: Team NUDT code for DCASE2018Task2. (75 stars)