VGGSound: A Large-scale Audio-Visual Dataset (2004.14368v2)

Published 29 Apr 2020 in cs.CV, cs.SD, and eess.AS

Abstract: Our goal is to collect a large-scale audio-visual dataset with low label noise from videos in the wild using computer vision techniques. The resulting dataset can be used for training and evaluating audio recognition models. We make three contributions. First, we propose a scalable pipeline based on computer vision techniques to create an audio dataset from open-source media. Our pipeline involves obtaining videos from YouTube; using image classification algorithms to localize audio-visual correspondence; and filtering out ambient noise using audio verification. Second, we use this pipeline to curate the VGGSound dataset consisting of more than 210k videos for 310 audio classes. Third, we investigate various Convolutional Neural Network~(CNN) architectures and aggregation approaches to establish audio recognition baselines for our new dataset. Compared to existing audio datasets, VGGSound ensures audio-visual correspondence and is collected under unconstrained conditions. Code and the dataset are available at http://www.robots.ox.ac.uk/~vgg/data/vggsound/

Authors (4)

Honglie Chen (14 papers)
Weidi Xie (132 papers)
Andrea Vedaldi (195 papers)
Andrew Zisserman (248 papers)

Citations (484)

View on Semantic Scholar

Summary

VGG-Sound: An Overview of a Large-Scale Audio-Visual Dataset

The development of robust audio recognition models necessitates access to comprehensive and diverse datasets. The paper "VGG-Sound: A Large-scale Audio-Visual Dataset" addresses this need by introducing the VGG-Sound dataset, which aims to enhance audio recognition capabilities by providing a large-scale, meticulously curated collection of audio-visual data. This resource is composed of approximately 200,000 video clips across 309 unique audio classes, collected in the wild from open-source media platforms, primarily YouTube.

Key Contributions

The authors make three main contributions to the field:

Automated Data Collection Pipeline: The paper presents a scalable and automated pipeline leveraging computer vision techniques to collect and curate an audio-visual dataset. By employing image classification algorithms, the pipeline is designed to ensure the accurate pairing of audio with its visual source, thus minimizing label noise. This process reduces the dependence on human annotation, a limitation in many existing datasets.
Creation and Structure of VGG-Sound: Utilizing the aforementioned pipeline, the VGG-Sound dataset was curated, encompassing over 200,000 video clips representing 309 audio classes. This dataset stands out due to its audio-visual correspondence, where the source of the audio is visible in the video clip. The depth and diversity of VGG-Sound make it a valuable asset for training and evaluating models in unconstrained environments, providing a more realistic setting for model development compared to smaller, manually curated datasets.
Baseline Models and Benchmarks: The authors establish baseline performance metrics using Convolutional Neural Networks (CNNs) to validate the dataset's utility in audio recognition tasks. They explore different neural architectures and pooling strategies, setting a performance standard for future research leveraging VGG-Sound.

Numerical Results and Claims

The paper evaluates the performance of several CNN architectures trained on the VGG-Sound dataset. The models, tested across different metrics, exhibit improvements in mean average precision (mAP), Area Under the Receiver Operating Characteristic Curve (AUC), and d-prime when trained on VGG-Sound compared to previous benchmarks. The highest performing model achieved a top-1 accuracy of 51.0% and a top-5 accuracy of 76.4%, underscoring the dataset's effectiveness in supporting state-of-the-art audio recognition tasks.

Implications and Future Directions

The introduction of VGG-Sound has significant implications for both practical and theoretical advancements in audio recognition. Practically, the dataset enables the development of models capable of operating in realistic, noisy conditions, an essential requirement for applications such as virtual assistants, security systems, and multimedia indexing. Theoretically, VGG-Sound provides a platform to explore audio-visual learning and multi-modal integration, fostering research that can result in more holistic AI systems capable of understanding and interacting with their environments.

Looking to the future, the large-scale nature of VGG-Sound might encourage the investigation of more complex network architectures and techniques like self-supervised learning and transfer learning. Additionally, the dataset's availability could inspire further exploration into unsupervised and semi-supervised learning paradigms to further optimize models' performance and adaptability.

In summary, VGG-Sound represents a substantial contribution to the field of audio recognition, providing a rich resource that addresses scalability and accuracy in dataset compilation, thereby paving the way for future innovations in AI-driven audio analysis.

PDF Markdown