Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data (1502.06648v2)

Published 23 Feb 2015 in cs.CV

Abstract: Activity recognition has shown impressive progress in recent years. However, the challenges of detecting fine-grained activities and understanding how they are combined into composite activities have been largely overlooked. In this work we approach both tasks and present a dataset which provides detailed annotations to address them. The first challenge is to detect fine-grained activities, which are defined by low inter-class variability and are typically characterized by fine-grained body motions. We explore how human pose and hands can help to approach this challenge by comparing two pose-based and two hand-centric features with state-of-the-art holistic features. To attack the second challenge, recognizing composite activities, we leverage the fact that these activities are compositional and that the essential components of the activities can be obtained from textual descriptions or scripts. We show the benefits of our hand-centric approach for fine-grained activity classification and detection. For composite activity recognition we find that decomposition into attributes allows sharing information across composites and is essential to attack this hard task. Using script data we can recognize novel composites without having training data for them.

PDF Abstract

Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data

This paper focuses on addressing two major challenges in the domain of activity recognition: (1) the detection of fine-grained activities characterized by low inter-class variability and subtle differences in body motion, particularly those involving hands, and (2) the recognition of composite activities composed of multiple smaller activities often described using textual narratives, or scripts. The paper proposes a novel approach encompassing a dataset called MPII Cooking 2, which provides rich annotations, and various methodological innovations, such as hand-centric features and leveraging script data.

Dataset and Problem Context

The MPII Cooking 2 dataset offers a comprehensive and detailed video corpus featuring cooking activities. It is distinguished by its realistic scenarios captured through monocular video recordings of human subjects preparing a variety of dishes. These composite activities include multiple fine-grained actions like chopping, peeling, and stirring, often in interaction with specific objects. The dataset is annotated with activity and participant labels, as well as manual annotations for pose estimation and textual scripts describing the activities.

Methodological Contributions

The paper's methodological contributions are twofold. First, it introduces hand-centric approaches for fine-grained activity recognition. The authors integrate a hand detector into pose estimation techniques to more reliably capture hand positions even in cases of partial occlusions. Dense Trajectories and color-based features extracted from hand regions are leveraged to identify subtle hand movements and object manipulations in fine-grained activity recognition tasks.

Second, the paper presents an attribute-based model for recognizing composite activities. Attributes here denote a combination of fine-grained activities and their participating objects. These attributes are aggregated using script data descriptions, which provide semantic associations between attributes and composite activities. By mining this script data using techniques like tf-idf, the authors introduce methods to leverage textual information to improve recognition performance and enable zero-shot learning of composite activities without direct training data.

Empirical Performance

The experimental evaluations show that hand-centric features outperform holisticly extracted features like state-of-art Dense Trajectories in recognizing fine-grained activities, as they focus specifically on hand activity, which is central to these tasks. Comparison against pose-based features further demonstrates the need for targeted hand-centric methods in scenarios with low inter-class variability.

For composite activity recognition, the attribute-based approach is shown to be highly effective. When incorporating noisy yet comprehensive script data, significant improvements in recognition accuracy are achieved, particularly in scenarios with sparse training data. The results indicate the robustness of attribute-based modeling over low-level video descriptors, validating the approach of leveraging shared intermediate representations across activities.

Implications and Future Directions

The paper sets a precedent for integrating textual script data into activity recognition tasks, opening pathways for future exploration in domains with complex, composite activities. This technique can be extended to various fields such as assistive robotics, healthcare, and human-computer interaction, where understanding complex human activities is essential. Future work could involve enhancing temporal modeling to incorporate sequential knowledge from textual scripts, providing a more holistic understanding of composite activities and their eventual applications.

In summary, the paper effectively advances the domain of activity recognition by proposing hand-centric features for fine-grained activities and utilizing script data for composite activity recognition. This combination not only improves recognition accuracy but also facilitates attribute sharing and learning across novel scenarios, demonstrating significant potential for evolving AI applications in practical contexts.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Marcus Rohrbach (75 papers)
Anna Rohrbach (53 papers)
Michaela Regneri (5 papers)
Sikandar Amin (8 papers)
Mykhaylo Andriluka (19 papers)
Manfred Pinkal (10 papers)
Bernt Schiele (210 papers)

Citations (172)

View on Semantic Scholar

Recognizing Fine-Grained and Composite Activities using Hand-Centric Features and Script Data (1502.06648v2)