Large-scale weakly supervised audio classification using gated convolutional neural network (1710.00343v1)

Published 1 Oct 2017 in cs.SD and eess.AS

Abstract: In this paper, we present a gated convolutional neural network and a temporal attention-based localization method for audio classification, which won the 1st place in the large-scale weakly supervised sound event detection task of Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge. The audio clips in this task, which are extracted from YouTube videos, are manually labeled with one or a few audio tags but without timestamps of the audio events, which is called as weakly labeled data. Two sub-tasks are defined in this challenge including audio tagging and sound event detection using this weakly labeled data. A convolutional recurrent neural network (CRNN) with learnable gated linear units (GLUs) non-linearity applied on the log Mel spectrogram is proposed. In addition, a temporal attention method is proposed along the frames to predicate the locations of each audio event in a chunk from the weakly labeled data. We ranked the 1st and the 2nd as a team in these two sub-tasks of DCASE 2017 challenge with F value 55.6\% and Equal error 0.73, respectively.

Citations (225)

View on Semantic Scholar

Summary

The paper introduces a gated CNN with GLUs to enhance weakly supervised audio classification by selectively emphasizing relevant time-frequency features.
The paper incorporates a temporal attention mechanism to accurately localize sound events within long audio sequences from weak labels.
The study unifies audio tagging and sound event detection in one framework, achieving state-of-the-art F1-score and error rate benchmarks.

Overview of "Large-Scale Weakly Supervised Audio Classification Using Gated Convolutional Neural Network"

The paper outlines a novel approach to weakly supervised audio classification and sound event detection using a gated convolutional recurrent neural network (CRNN). The method was evaluated in the context of the Detection and Classification of Acoustic Scenes and Events (DCASE) 2017 challenge, where it secured top positions in the audio tagging and sound event detection tasks. This paper is particularly notable for its effective utilization of weakly labelled data, where audio tags are present without precise event timestamps.

Key Contributions

Gated Convolutional Neural Networks: The authors introduce the use of gated linear units (GLUs) in convolutional neural networks (CNNs) for audio classification tasks. The GLUs replace traditional ReLU activations, providing an internal attention mechanism that selectively controls information flow through the network layers. This feature helps in attending to relevant time-frequency units in audio data, thereby enhancing model performance.
Temporal Attention for Localization: A temporal attention mechanism is proposed, allowing the model to predict the temporal locations of sound events within audio chunks from weakly labelled data. This approach effectively addresses the challenge of accurately detecting the location of short-lived audio events in longer audio sequences.
Unified Model Architecture: The paper emphasizes a unified framework that efficiently handles both audio tagging and weakly supervised sound event detection simultaneously. The shared architecture reduces model complexity and enhances scalability in handling large-scale audio datasets.

Results and Implications

The experimental results demonstrate the efficacy of the proposed method. The final system achieved an F1-score of 55.6% for audio tagging and an equal error rate of 0.73 in sound event detection on the evaluation set, ranking 1st and 2nd respectively. These outcomes underscore the potential of applying GLUs within CNNs and leveraging attention mechanisms in audio event detection, suggesting advancements over existing multilayer perceptron (MLP) based baselines.

From a practical standpoint, the proposed method's application extends beyond academic benchmarks. It holds significant promise for real-world uses, such as surveillance and environmental monitoring, where sound events need to be detected with limited labelled data.

Future Prospects

The research indicates several areas worth exploring in future works. Continued development in attention-based models could improve localization accuracy, especially in more complex audio environments. Additionally, expanding this approach to a broader range of audio datasets, such as those beyond the DCASE challenge, could further validate the model's robustness and adaptability.

In conclusion, this paper contributes a significant advancement in weakly supervised audio classification, offering a robust framework that combines gated neural network architectures with temporal attention mechanisms. It represents a substantial step forward in the effective analysis of audio data, with implications for both theoretical research and practical deployment in automated sound detection systems.

PDF Markdown