A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis

Published 30 Oct 2020 in eess.AS | (2011.00030v2)

Abstract: This paper introduces a curated dataset of urban scenes for audio-visual scene analysis which consists of carefully selected and recorded material. The data was recorded in multiple European cities, using the same equipment, in multiple locations for each scene, and is openly available. We also present a case study for audio-visual scene recognition and show that joint modeling of audio and visual modalities brings significant performance gain compared to state of the art uni-modal systems. Our approach obtained an 84.8% accuracy compared to 75.8% for the audio-only and 68.4% for the video-only equivalent systems.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (48)

View on Semantic Scholar

Summary

Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis

The paper "A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis" presents a meticulously constructed dataset aimed at enhancing research in audio-visual scene recognition. Authored by Shanshan Wang, Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen, the work is pivotal in addressing the limitations of existing datasets, which often suffer from inconsistencies due to varied recording equipment and automatic annotation errors.

Overview of the Data Collection

The dataset is distinctive in its methodological rigor and scope. It includes data recorded across 12 European cities, ensuring diversity in urban scene representation. The scenes were pre-defined based on acoustic and visual characteristics to maintain homogeneity within the categories. Audio and video data were captured using synchronized equipment — a GoPro Hero5 for video and a binaural microphone paired with a Zoom F8 audio recorder for audio. This protocol ensures consistent audio and visual quality, which is a substantial improvement over datasets with irregular quality resulting from the automatic collection processes, such as those observed in YouTube-derived datasets.

Significance and Approach

The authors carry out a case study on audio-visual scene classification (AVSC), demonstrating the efficacy of joint modeling compared to uni-modal systems. The dataset enables a classification performance with an accuracy of 84.8% using a joint model, compared to 75.8% for audio-only and 68.4% for video-only models. This suggests that the integration of both audio and visual modalities significantly enhances classification capabilities.

The audio-visual system, based on OpenL3 embeddings and early fusion strategy, exhibits remarkable improvements, particularly for certain scene classes like shopping malls and pedestrian streets. The joint learning approach underscores the importance of leveraging cross-modal data to achieve better results in scene classification tasks.

Implications and Future Prospects

This curated dataset holds significant potential for advancing research in fields such as sound event detection, source localization, and cross-modal knowledge transfer. The fixed recording position and the absence of motion from the recorder address the confounding issues typically associated with movement in recording devices, allowing clearer analysis of source characteristics. Moreover, the open availability of the dataset is a notable contribution to the research community, providing a foundation for future explorations into audio-visual correlations and scene recognition models.

The paper suggests a move towards improved computational efficiency in AVSC systems, with the proposed system being significantly lighter in terms of parameters compared to state-of-the-art alternatives. The potential for deploying such models in real-world applications is substantial, especially in developing context-aware systems and intelligent surveillance mechanisms.

Conclusion

In sum, this paper presents a comprehensive effort to elevate the standards of data collection for audio-visual scene analysis. By addressing the quality and consistency issues prevalent in existing datasets and demonstrating the advantages of joint audio-visual modeling, this research paves the way for advancements in pattern recognition and multimedia analysis. The dataset's availability invites further inquiry and innovation, fostering developments that could extend beyond scene classification into broader applications within artificial intelligence and urban analytics.

Markdown Report Issue