Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification (1608.04363v2)

Published 15 Aug 2016 in cs.SD, cs.CV, cs.LG, and cs.NE

Abstract: The ability of deep convolutional neural networks (CNN) to learn discriminative spectro-temporal patterns makes them well suited to environmental sound classification. However, the relative scarcity of labeled data has impeded the exploitation of this family of high-capacity models. This study has two primary contributions: first, we propose a deep convolutional neural network architecture for environmental sound classification. Second, we propose the use of audio data augmentation for overcoming the problem of data scarcity and explore the influence of different augmentations on the performance of the proposed CNN architecture. Combined with data augmentation, the proposed model produces state-of-the-art results for environmental sound classification. We show that the improved performance stems from the combination of a deep, high-capacity model and an augmented training set: this combination outperforms both the proposed CNN without augmentation and a "shallow" dictionary learning model with augmentation. Finally, we examine the influence of each augmentation on the model's classification accuracy for each class, and observe that the accuracy for each class is influenced differently by each augmentation, suggesting that the performance of the model could be improved further by applying class-conditional data augmentation.

Citations (1,259)

View on Semantic Scholar

Summary

The paper proposes a novel CNN design that learns discriminative spectro-temporal patterns for environmental sound classification.
The authors demonstrate that targeted data augmentation, including pitch shifting and time stretching, boosts accuracy from 0.73 to 0.79.
The study underscores the potential for class-conditional augmentations to further improve performance in low-data scenarios.

Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification

This paper presents a systematic paper on the application of deep convolutional neural networks (CNNs) and data augmentation for the task of environmental sound classification. The authors, Justin Salamon and Juan Pablo Bello, propose a novel CNN architecture specifically designed to learn discriminative spectro-temporal patterns from audio data. To address the challenge of limited labeled data, the paper extensively explores the influence of various audio data augmentations on model performance.

Methodology and Approach

The proposed CNN architecture consists of three convolutional layers with 5x5 receptive fields, interleaved with pooling layers, followed by two fully connected layers. This design aims to capture localized spectro-temporal patterns that can be mapped to high-level acoustic signatures indicative of different environmental sound classes.

Training such deep models typically demands large amounts of labeled data, which is often scarce for environmental sound classification. To mitigate this, the authors implement several data augmentation techniques, including time stretching, pitch shifting, dynamic range compression, and mixing with background noise. Each technique is carefully parameterized to maintain the semantic integrity of the labeled data.

Experimental Evaluation

The model evaluation utilizes the UrbanSound8K dataset, comprising 8732 audio clips across ten environmental sound classes. The authors perform a 10-fold cross-validation to assess the model's performance, enabling direct comparison with previous approaches, including a dictionary learning model (SKM) and a previous CNN (PiczakCNN).

Results

The results show that the proposed CNN without augmentation achieves a mean classification accuracy of 0.73, comparable to both SKM and PiczakCNN. However, when the data augmentation techniques are applied, the classification accuracy of the proposed CNN increases significantly to 0.79. This underscores the importance of combining a high-capacity model with augmented training data.

Further analysis reveals that different augmentations affect each sound class differently. Specifically, pitch shifting emerges as the most universally beneficial augmentation, while others like dynamic range compression and background noise have varied effects depending on the class. This insight suggests potential improvements through class-conditional augmentation strategies, which could be tailored to the unique characteristics of each sound class.

Implications and Future Directions

The implications of this paper are twofold. Practically, the integration of CNNs with data augmentation techniques provides a robust framework for enhancing environmental sound classification systems. Theoretically, the research demonstrates the synergistic effect of combining deep learning models with carefully chosen augmentations, paving the way for more sophisticated models that can handle limited data scenarios.

Future research could explore class-conditional data augmentation to optimize performance further. Additionally, extending this methodology to other domains with limited annotated data, such as bioacoustics or medical signal processing, could yield valuable insights and applications.

In summary, this paper offers a comprehensive evaluation of CNNs and data augmentation for environmental sound classification, providing valuable contributions to both methodological practices and practical applications in the field. The proposed strategies demonstrate significant performance enhancements, setting a baseline for future developments in sound classification tasks.

PDF Markdown

Related Papers

YouTube

Show All Videos