EasyCom: An Augmented Reality Dataset to Support Algorithms for Easy Communication in Noisy Environments (2107.04174v2)

Published 9 Jul 2021 in cs.SD, cs.CV, cs.LG, eess.AS, and eess.SP

Abstract: Augmented Reality (AR) as a platform has the potential to facilitate the reduction of the cocktail party effect. Future AR headsets could potentially leverage information from an array of sensors spanning many different modalities. Training and testing signal processing and machine learning algorithms on tasks such as beam-forming and speech enhancement require high quality representative data. To the best of the author's knowledge, as of publication there are no available datasets that contain synchronized egocentric multi-channel audio and video with dynamic movement and conversations in a noisy environment. In this work, we describe, evaluate and release a dataset that contains over 5 hours of multi-modal data useful for training and testing algorithms for the application of improving conversations for an AR glasses wearer. We provide speech intelligibility, quality and signal-to-noise ratio improvement results for a baseline method and show improvements across all tested metrics. The dataset we are releasing contains AR glasses egocentric multi-channel microphone array audio, wide field-of-view RGB video, speech source pose, headset microphone audio, annotated voice activity, speech transcriptions, head bounding boxes, target of speech and source identification labels. We have created and are releasing this dataset to facilitate research in multi-modal AR solutions to the cocktail party problem.

Authors (9)

Jacob Donley (19 papers)
Vladimir Tourbabin (15 papers)
Jung-Suk Lee (4 papers)
Mark Broyles (1 paper)
Hao Jiang (230 papers)
Jie Shen (169 papers)
Maja Pantic (100 papers)
Vamsi Krishna Ithapu (24 papers)
Ravish Mehra (2 papers)

Citations (55)

View on Semantic Scholar

Summary

The paper introduces a novel AR dataset with synchronized audio-visual data to mitigate the cocktail party effect in dynamic environments.
The dataset supports advanced beamforming and speech enhancement through detailed annotations and egocentric recordings.
Initial evaluations using maximum DI beamforming demonstrate significant improvements in speech intelligibility and overall communication quality.

Overview of the EasyCom Dataset Paper

The paper introduces EasyCom, a novel augmented reality (AR) dataset designed to enhance research in multi-modal solutions for mitigating the cocktail party effect. The authors present a comprehensive dataset offering synchronized multi-channel audio and video data to support the development of algorithms in demanding auditory environments. It stands out for providing a rare combination of egocentric audio-visual data, accommodating dynamic user movements, thus filling a notable gap in available resources for studying AR-based conversational technologies.

Data Composition

The EasyCom dataset includes more than five hours of audio-visual recordings. These recordings are captured from the perspective of AR glasses, providing a user-centered viewpoint crucial for improving algorithms related to beam-forming and speech enhancement. The dataset is meticulously designed, featuring multi-channel microphone array audio and wide field-of-view RGB video, alongside annotated labels for voice activity, speech transcriptions, and more.

Methodological Contributions

The introduction of EasyCom addresses the significant absence of comprehensive datasets with dynamic, multi-modal information tailored towards scenarios that mimic real-world auditory environments—particularly where multiple overlapping speech sources and background noise are prevalent. It sets itself apart from prior datasets by including synchronized egocentric views and providing annotations that are pivotal for training and testing algorithms in AR applications that aim to mitigate the cocktail party effect, such as speech intelligibility enhancements and noise reduction.

Evaluation and Benchmarking

To establish a benchmark, the authors perform an initial evaluation of a baseline method that uses maximum Directivity Index (DI) beamforming to enhance conversational focus. This initial analysis provides quantifiable metrics showing improved speech intelligibility and quality, serving as a reference point for future work. The paper describes the potential use of the dataset across various domains, including voice activity detection, audio-visual noise reduction, and audio-visual speech separation.

Implications and Future Directions

Practically, the EasyCom dataset could significantly advance the development of AR systems capable of functioning effectively in noisy environments—potentially reshaping how audio processing tasks are approached in AR applications. Theoretically, it propels the field towards richer, data-driven models that adeptly integrate auditory and visual information to improve user interactions.

The release of the EasyCom dataset is likely to inspire future research dedicated to refining dual-modality interactions, leading to advancements in seamless human-computer interaction paradigms that reflect naturalistic communication conditions. This rich dataset opens new avenues for exploring adaptive AR functionalities, particularly in scenarios combining complex auditory tasks with real-time processing requirements.

In conclusion, EasyCom presents a foundational step towards developing sophisticated AR solutions for improving communication in noisy settings. By combining high-quality audio-visual data with robust annotations, it offers a fertile ground for advancing the next generation of AR speech enhancement technologies, inviting researchers worldwide to innovate and push the boundaries of what's possible in augmented reality systems.

PDF Markdown

Related Papers

GitHub

GitHub - facebookresearch/EasyComDataset: The Easy Communications (EasyCom) dataset is a world-first dataset designed to help mitigate the *cocktail party effect* from an augmented-reality (AR) -motivated multi-sensor egocentric world view. (119 stars)