Learning from Between-class Examples for Deep Sound Recognition (1711.10282v2)

Published 28 Nov 2017 in cs.LG, cs.SD, eess.AS, and stat.ML

Abstract: Deep learning methods have achieved high performance in sound recognition tasks. Deciding how to feed the training data is important for further performance improvement. We propose a novel learning method for deep sound recognition: Between-Class learning (BC learning). Our strategy is to learn a discriminative feature space by recognizing the between-class sounds as between-class sounds. We generate between-class sounds by mixing two sounds belonging to different classes with a random ratio. We then input the mixed sound to the model and train the model to output the mixing ratio. The advantages of BC learning are not limited only to the increase in variation of the training data; BC learning leads to an enlargement of Fisher's criterion in the feature space and a regularization of the positional relationship among the feature distributions of the classes. The experimental results show that BC learning improves the performance on various sound recognition networks, datasets, and data augmentation schemes, in which BC learning proves to be always beneficial. Furthermore, we construct a new deep sound recognition network (EnvNet-v2) and train it with BC learning. As a result, we achieved a performance surpasses the human level.

Citations (230)

View on Semantic Scholar

Summary

The paper introduces a novel Between-Class (BC) learning method that mixes different class sounds to significantly improve feature discriminability in deep sound recognition.
The methodology uses linear combinations of sound signals to create between-class examples that enhance Fisher's criterion by enlarging the ratio of between-class distance to within-class variance.
Empirical results demonstrate substantial accuracy improvements across architectures, with EnvNet-v2 achieving a 15.1% error rate on ESC-50, surpassing human performance.

Learning from Between-class Examples for Deep Sound Recognition

The paper "Learning from Between-class Examples for Deep Sound Recognition" introduces a novel method termed Between-Class learning (BC learning) to enhance the performance of deep sound recognition systems. This approach leverages the principle of creating between-class examples by mixing sounds from different classes, trained to predict their mixing ratios. It represents an advancement over traditional methods by focusing not only on augmenting the dataset with more variations but also on improving the discriminative properties of the learned feature space.

Key Contributions and Methodology

BC learning stands out by integrating a unique approach in the handling and utilization of training data for sound recognition tasks. The authors detail a process where between-class sounds are synthesized through a linear combination of sounds from different classes. This is contrary to typical augmentation techniques which generally involve transformations applied within the same class. The method proposed in this paper yields the following significant theoretical and empirical benefits:

Enhanced Feature Discriminability: BC learning enlarges Fisher's criterion within the feature space, which is fundamentally the ratio of the between-class distance to the within-class variance. Such an enlargement occurs because mixed examples inherently regularize the positional relationships of class-based feature distributions, thus encouraging the formation of a more discriminative feature space.
Regularization among Classes: Unlike traditional learning approaches where cross-class relationships in feature space are arbitrary, BC learning actively prevents the decision boundaries of unrelated classes from intervening between any two classes, thereby organizing class distributions more effectively.
Empirical Performance: Experimental evaluations clearly illustrate the consistent improvement of BC learning across various sound recognition architectures and datasets. For instance, deploying BC learning with EnvNet-v2 on the ESC-50 dataset improved the classification error rate significantly to 15.1%, outperforming even human recognition levels.

Architectural and Experimental Details

The experimental section validates the efficacy of the BC learning method across several deep learning architectures, including EnvNet, SoundNet5, M18, and Logmel-CNN. Each of these architectures was enhanced by BC learning, which resulted in a marked improvement in sound classification performance:

EnvNet-v2 Development: This newly proposed deep sound recognition architecture integrates deeper layers and utilizes a higher sampling rate (44.1 kHz) to harness richer high-frequency information. Its performance when trained with BC learning surpasses human-level recognition accuracy on specific benchmarks.
Comprehensive Experiments: Through the evaluation on standard datasets like ESC-50, ESC-10, and UrbanSound8K, and by employing varied augmentation schemes, BC learning's robust performance enhancements were corroborated beyond a mere increase in training data variance alone.

Implications and Future Directions

The implications of this work are twofold. Firstly, it provides a refined methodological framework for improving the generalization capabilities of sound recognition models, which could be adapted for other domains involving different types of inputs like images or text data. Secondly, the paper hints at future explorations regarding the mixing of more than two classes or applying BC learning principles in advanced sound-related tasks such as polyphonic sound event detection or multi-label classification.

Future research could explore optimizing BC learning for different types of neural network architectures or for applications extending beyond sound recognition. Additionally, the exploration of applying BC learning to other modalities could unveil its potential as a generic technique for enhancing the discriminative learning in complex neural networks.

This paper delivers a substantial contribution to the field of deep learning for sound recognition, introducing a methodological innovation that balances theoretical elegance with practical efficacy.