Device-Robust Acoustic Scene Classification Based on Two-Stage Categorization and Data Augmentation (2007.08389v2)

Published 16 Jul 2020 in eess.AS, cs.LG, and cs.SD

Abstract: In this technical report, we present a joint effort of four groups, namely GT, USTC, Tencent, and UKE, to tackle Task 1 - Acoustic Scene Classification (ASC) in the DCASE 2020 Challenge. Task 1 comprises two different sub-tasks: (i) Task 1a focuses on ASC of audio signals recorded with multiple (real and simulated) devices into ten different fine-grained classes, and (ii) Task 1b concerns with classification of data into three higher-level classes using low-complexity solutions. For Task 1a, we propose a novel two-stage ASC system leveraging upon ad-hoc score combination of two convolutional neural networks (CNNs), classifying the acoustic input according to three classes, and then ten classes, respectively. Four different CNN-based architectures are explored to implement the two-stage classifiers, and several data augmentation techniques are also investigated. For Task 1b, we leverage upon a quantization method to reduce the complexity of two of our top-accuracy three-classes CNN-based architectures. On Task 1a development data set, an ASC accuracy of 76.9\% is attained using our best single classifier and data augmentation. An accuracy of 81.9\% is then attained by a final model fusion of our two-stage ASC classifiers. On Task 1b development data set, we achieve an accuracy of 96.7\% with a model size smaller than 500KB. Code is available: https://github.com/MihawkHu/DCASE2020_task1.

Citations (54)

View on Semantic Scholar

Summary

The paper presents a novel two-stage CNN system that significantly improves acoustic scene classification under device variability.
It employs diverse data augmentation techniques such as mixup, pitch shifting, and speed changes to enhance model robustness.
Efficient model quantization enabled deployment on constrained devices while maintaining high accuracy across classification tasks.

Device-Robust Acoustic Scene Classification Using Two-Stage Categorization and Data Augmentation

The paper presents a sophisticated approach to solving acoustic scene classification (ASC) challenges associated with device variability, as posed by the DCASE 2020 Challenge. The research, executed by a coalition of four groups—Georgia Institute of Technology, University of Science and Technology of China, Tencent, and University of Enna Kore—focused on developing solutions for Task 1 of the challenge, which included sub-tasks 1a and 1b. Task 1a dealt with ASC using multiple real and simulated devices across ten finely-grained classes, whereas Task 1b emphasized concise solutions for classifying data into three broader categories.

Methodology

For Task 1a, the authors devised a novel two-stage classification system leveraging convolutional neural networks (CNNs), which first classifies audio inputs into three primary classes (indoor, outdoor, and transportation) and subsequently into ten more specific categories. Four CNN architectures—FCNN, fsFCNN, Resnet, and Mobnet—were employed to construct the two-stage classifiers. Data augmentation strategies were extensively explored to enhance device robustness. Contributions from data augmentation appeared in two forms: non-generating, such as random cropping and mixup, and data-generating, like pitch shifting and speed changes. This robust two-stage approach achieved an ASC accuracy of 81.9% on the task's development dataset.

For Task 1b, two efficient CNN architectures (Mobnet and a smaller FCNN version) were introduced to comply with the sub-task's stringent model size constraint of under 500KB. Post-training quantization was applied to compress the model size while preserving ASC accuracy. An accuracy of 96.7% was attained on this sub-task, demonstrating minimal performance sacrifices attributed to model compression.

Results and Implications

The described frameworks show impressive ASC accuracy on both sub-tasks compared to baseline models. The two-stage fusion strategy significantly improves classification precision, indicating potential for better device-invariance in ASC applications. From a practical standpoint, effective use of CNN architectures alongside strategic data augmentation methods elucidates pathways to mitigating device variability issues in real-world acoustic environments. Additionally, the applied model quantization techniques demonstrate promising reductions in memory footprint, important for deploying ASC systems on constrained devices like smartphones.

Future Directions

The work informs several future research avenues in AI and ASC. First, advancing two-stage classifiers towards more nuanced environmental categorization could enhance system adaptability to less clear-cut acoustic scenes, mitigating human classification challenges. Second, as mobile and internet of things (IoT) devices proliferate, further thrusts into resource-efficient neural network architectures and better compression techniques are warranted. Finally, exploration on adaptive models that dynamically adjust to varying acoustic conditions without explicit data augmentation could benefit real-time applications.

In conclusion, this research offers substantial progress in developing robust ASC systems resilient to device discrepancies, using CNN-based architectures, purposeful data augmentation, and quantization. These insights and methodologies serve as a resourceful reference point for future ASC system developments and applications under similar constraints and challenges.

PDF Markdown

Related Papers

GitHub

GitHub - MihawkHu/DCASE2020_task1: Code for DCASE 2020 task 1a and task 1b. (83 stars)