Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding (1604.01753v3)

Published 6 Apr 2016 in cs.CV

Abstract: Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 seconds, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.

Citations (1,152)

View on Semantic Scholar

Summary

The paper presents a crowdsourced method that collected 9,848 annotated videos of household activities, forming the novel Charades dataset.
The methodology uses Amazon Mechanical Turk for script generation, video recording, and detailed multi-step annotation to ensure data diversity and quality.
Baseline evaluations, including a 17.2% mAP for action recognition, underline the dataset's potential to advance real-world activity understanding research.

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

"Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding" by Sigurdsson et al. introduces an innovative method for collecting extensive datasets of everyday activities, termed the "Hollywood in Homes" approach. This research addresses the need for diverse, real-world video data to train computer vision models, focusing specifically on mundane, daily scenarios that are often underrepresented in current datasets.

The core contribution of the paper is the creation of a new dataset, named Charades, which comprises 9,848 annotated videos of common household activities, spanning multiple continents and involving 267 participants. Each video averages 30 seconds and is annotated with 27,847 video descriptions, 66,500 temporally localized action intervals, and 41,104 object interaction labels. This dataset presents a significant volume of data for tasks like action recognition and description generation.

Methodology

The "Hollywood in Homes" approach decentralizes the traditional model of data collection by harnessing the power of crowdsourcing. The process involves three distinct steps:

Script Generation: Using Amazon Mechanical Turk (AMT), participants generate scripts based on a given scene and a selection of objects and actions.
Video Recording: Participants then act out these scripts in their own homes, ensuring diversity in backgrounds, interactions, and environments.
Annotation: The recorded videos are annotated by multiple workers to verify and label interactions, ensuring high-quality data.

This process contrasts with previous datasets that often rely on controlled environments or curated online content, which tend to skew towards more engaging or sports-related activities.

Dataset Characteristics

The Charades dataset includes 157 action classes captured in 15 indoor scenes, with vigorous labeling ensuring comprehensive data coverage. Actions are temporally segmented, providing fine-grained details suitable for various computer vision tasks. The magnitude and variety of data collected allow for robust baseline evaluations.

Baseline Evaluations

The paper evaluates several state-of-the-art algorithms for action recognition and sentence prediction using the Charades dataset. The main methods tested include:

Improved Dense Trajectories (IDT): Outperforms other methods with 17.2% mean average precision (mAP) for action classification.
Static CNN Features and Two-Stream Networks: Utilize neural networks pre-trained on large image datasets to provide benchmarks for action recognition.
Sentence Prediction Models: Including a CNN-LSTM model (S2VT), which combines spatial and temporal information for video captioning tasks.

The results highlight the dataset's complexity, with mAP scores indicating substantial room for improvement in action recognition.

Implications and Future Directions

The Charades dataset sets a new standard for computer vision research by providing a rich source of real-world data. Its diversity and realistic scenarios are critical for developing models that generalize well to everyday environments. Potential future developments include:

Enhanced Model Training: Leveraging the depth of Charades to improve models' understanding of complex interactions and context-based recognition.
Expansion of Data Collection: Applying the "Hollywood in Homes" approach to other domains, such as outdoor activities or specific professional environments, to gather more varied training data.
Robust Action-Object Interactions: Exploring the nuanced relationships between actions and objects to develop more sophisticated reasoning capabilities in AI systems.

Conclusion

The research presented by Sigurdsson et al. proposes a novel framework for dataset collection that harnesses the advantages of crowdsourcing while addressing the gaps in current data representations of everyday life. Charades, as an outcome, offers an extensive annotated video dataset that is poised to drive forward the capabilities of action recognition and automatic description generation in computer vision. By emphasizing the mundane and ordinary, Charades ensures that AI systems developed with this data are more attuned and adaptable to real-world settings. This work opens up new avenues for future research and development, emphasizing the importance of realistic data in advancing artificial intelligence.

PDF Markdown