A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling (1810.09050v3)

Published 22 Oct 2018 in cs.SD and eess.AS

Abstract: Sound event detection (SED) entails two subtasks: recognizing what types of sound events are present in an audio stream (audio tagging), and pinpointing their onset and offset times (localization). In the popular multiple instance learning (MIL) framework for SED with weak labeling, an important component is the pooling function. This paper compares five types of pooling functions both theoretically and experimentally, with special focus on their performance of localization. Although the attention pooling function is currently receiving the most attention, we find the linear softmax pooling function to perform the best among the five. Using this pooling function, we build a neural network called TALNet. It is the first system to reach state-of-the-art audio tagging performance on Audio Set, while exhibiting strong localization performance on the DCASE 2017 challenge at the same time.

Citations (178)

View on Semantic Scholar

Summary

The paper compares five multiple instance learning pooling functions for weakly labeled sound event detection, finding linear softmax pooling superior for localization and introducing the state-of-the-art TALNet system.
Linear softmax pooling effectively balances gradients to push frame probabilities to extremities, avoiding the false negatives of max pooling and false positives of average pooling.
Experimental results confirm linear softmax pooling's localization performance, especially in the TALNet system which demonstrates robust audio tagging and segmentation capabilities.

A Comparative Analysis of Multiple Instance Learning Pooling Functions for Sound Event Detection

The paper "A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak Labeling" addresses the critical component of pooling functions within the MIL framework for SED with weak labeling. It evaluates the theoretical and experimental performance of five pooling functions, focusing on their localization capacities. This paper introduces the TALNet system, which employs the linear softmax pooling function, aiming to achieve state-of-the-art results in both audio tagging and localization within the Audio Set and the DCASE 2017 challenge datasets.

Sound Event Detection (SED) is an essential subfield of machine learning concerning the identification and localization of sound events within audio streams. The research explores the efficacy of weak labeling and emphasizes the importance of pooling functions in the MIL approach. Weak labeling provides only the types of sound events occurring in a recording, omitting temporal details, which is a significant advantage in managing large datasets such as Google's Audio Set.

The paper investigates five pooling functions: max, average, linear softmax, exponential softmax, and attention pooling. Through a rigorous comparison, the authors reveal that linear softmax pooling outperforms its contemporaries in localization tasks due to its balanced approach towards gradients flowing through frame-level probabilities, enabling effective detection under the Standard Multiple Instance (SMI) assumption.

Pooling Function Analysis

Max Pooling: Although faithful to the SMI assumption, it restricts gradient descent to a single frame, potentially leading to numerous false negatives.
Average Pooling: Distributes gradients evenly, risking false positives by boosting undesired frame probabilities indiscriminately.
Linear Softmax Pooling: Exhibits promising capabilities by pushing frame probabilities to extremities, aligning with SMI assumptions and achieving consistent predictions across recording and frame-level probabilities.
Exponential Softmax Pooling: Offers a similar approach to the average pooling with moderated gradient application, yet raises concerns over false positives.
Attention Pooling: Impresses through dynamic weight learning but demonstrates misalignment in frame-level probabilistic outputs, notably when the weights and frame probabilities de-link from each other.

Experimental Outcomes and TALNet

The experimental comparison conducted on the DCASE 2017 dataset corroborates the theoretical assessments, with linear softmax pooling demonstrating superior localization performance, balancing false negatives and positives effectively. The TALNet system emerges as an innovative model, as it stands out in achieving robust performance for both audio tagging and localization concurrently.

The performance metrics including precision and recall in segment localization, confirmed the advantages of linear softmax pooling over others, especially when coupled with data balancing and batch normalization in the network architecture.

Implications and Future Direction

This paper provides a foundation for future explorations into the integration and optimization of pooling functions within SED frameworks, potentially extending to other domains requiring MIL methodologies. While the linear softmax has shown considerable promise, further innovation could be directed towards adaptive pooling functions integrating adjustable parameters to dynamically handle variabilities in data characteristics.

The adaptability inherent in attention-based models remains enticing, warranting further research into constraints that ensure monotonic alignment of frame-level probabilities and attention weights. By optimizing these interactions, future MIL applications can potentially enhance consistency and performance outputs, particularly in weakly labeled datasets.

In conclusion, this research notably contributes to the sound event detection landscape, providing insights into pooling function performance, and introducing TALNet, a system proficient in both tagging and localization. Its findings pave the way for more encompassing models and pooling strategies that could generalize across various types of MIL tasks.