CNN Architectures for Large-Scale Audio Classification (1609.09430v2)

Published 29 Sep 2016 in cs.SD, cs.LG, and stat.ML

Abstract: Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

Citations (2,347)

View on Semantic Scholar

Summary

The paper demonstrates that CNN architectures, especially Inception V3 and ResNet-50, significantly outperform a DNN baseline on audio classification tasks.
Training on the massive YouTube-100M dataset enabled evaluation with metrics such as AUC, mAP, and d-prime, revealing diminishing returns beyond 7M videos.
The research extends to Acoustic Event Detection, where leveraging CNN embeddings notably improved performance on the Audio Set benchmark.

CNN Architectures for Large-Scale Audio Classification

The paper explores the comparative performance of various Convolutional Neural Network (CNN) architectures when applied to the task of large-scale audio classification. Using a massive dataset named YouTube-100M, which contains 70 million training videos totaling 5.24 million hours and annotated with 30,871 distinct video-level labels, the authors investigate how CNNs, originally successful in image classification, fare in audio-based tasks.

Dataset and Experimental Setup

The YouTube-100M dataset provides a unique opportunity to scrutinize large-scale audio classification tasks. Each video is annotated with multiple labels derived from a predefined set of 30,871 unique tags. The labels are generated automatically based on metadata and visual content, which introduces a certain degree of noise due to potential inaccuracies.

The CNN models evaluated include popular architectures like AlexNet, VGG, Inception V3, and ResNet-50. To create a standardized input for these networks, audio from each video is divided into non-overlapping 960ms frames, and each frame is converted into a log-mel spectrogram with dimensions of 96x64 bins. These spectrogram patches are then fed into the CNN architectures. Training employed the Adam optimizer with asynchronous updates across multiple GPUs, and each model used a final sigmoid layer to accommodate multi-label classification.

Results and Evaluation

The paper presents thorough evaluations of each architecture based on metrics such as AUC (Area Under the ROC Curve), d-prime, and mean Average Precision (mAP). The results illustrate that CNNs significantly outperform the fully connected Deep Neural Network (DNN) baseline. Amongst the CNNs, Inception and ResNet-50 showed superior performance.

Key results include:

Baseline DNN: Achieved an AUC of 0.851 and an mAP of 0.058.
AlexNet: Improved to an AUC of 0.894 and an mAP of 0.115.
VGG: Further enhancement with an AUC of 0.911 and an mAP of 0.161.
Inception V3: Notable performance with an AUC of 0.918 and an mAP of 0.181.
ResNet-50: Highest performance with an AUC of 0.926 and an mAP of 0.212 after extended training.

Interestingly, while the performance metrics improved with more computationally sophisticated architectures, the benefits plateaued when further increasing the dataset size from 7 million to 70 million videos, suggesting diminishing returns with very large datasets.

Analysis of Label Set and Training Set Size

The authors investigated the impact of label set size by comparing the performance of models trained with varying numbers of labels—30,871, 3,087, and 400. Results indicated that larger vocabulary sizes marginally improve performance on smaller label sets due to better generalization.

When examining the effect of training set size, it was evident that models trained on larger datasets (up to 70 million videos) outperformed those trained on smaller subsets (e.g., 23,000 videos). However, beyond a certain point (approximately 7 million videos), the performance gains were minimal, suggesting an optimal dataset size for this classification task.

Acoustic Event Detection (AED)

One notable extension of this research is its application to Acoustic Event Detection (AED) tasks. By leveraging embeddings from a pre-trained ResNet model, the authors significantly improved AED performance on the Audio Set dataset, achieving an mAP of 0.314 and an AUC of 0.959, compared to a baseline mAP of 0.137 and an AUC of 0.904. This highlights the utility of large-scale training data and sophisticated embeddings in improving AED tasks.

Implications and Future Directions

The paper substantiates the application of image classification CNNs to audio classification, demonstrating their efficacy in large-scale settings. The results also emphasize the importance of large yet optimal training datasets for effective model training.

Future research directions could explore more specialized audio-specific architectures, the inclusion of temporal context through recurrent neural networks (RNNs), and advanced regularization techniques to mitigate overfitting on smaller datasets. Furthermore, investigating methods to handle weakly labeled or noisy data could further enhance model robustness and performance.

In conclusion, this work provides a comprehensive analysis of CNN architectures in large-scale audio classification and sets the stage for future explorations in both practical applications and theoretical advancements within the domain of audio processing and analysis.

PDF Markdown