Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks (1912.09930v3)

Published 20 Dec 2019 in cs.CV

Abstract: Human action is naturally compositional: humans can easily recognize and perform actions with objects that are different from those used in training demonstrations. In this paper, we study the compositionality of action by looking into the dynamics of subject-object interactions. We propose a novel model which can explicitly reason about the geometric relations between constituent objects and an agent performing an action. To train our model, we collect dense object box annotations on the Something-Something dataset. We propose a novel compositional action recognition task where the training combinations of verbs and nouns do not overlap with the test set. The novel aspects of our model are applicable to activities with prominent object interaction dynamics and to objects which can be tracked using state-of-the-art approaches; for activities without clearly defined spatial object-agent interactions, we rely on baseline scene-level spatio-temporal representations. We show the effectiveness of our approach not only on the proposed compositional action recognition task, but also in a few-shot compositional setting which requires the model to generalize across both object appearance and action category.

Authors (6)

Joanna Materzynska (12 papers)
Tete Xiao (19 papers)
Roei Herzig (34 papers)
Huijuan Xu (30 papers)
Xiaolong Wang (243 papers)
Trevor Darrell (324 papers)

Citations (166)

View on Semantic Scholar

Summary

Compositional Action Recognition with Spatial-Temporal Interaction Networks

The paper "Something-Else: Compositional Action Recognition with Spatial-Temporal Interaction Networks" introduces a novel approach to action recognition in videos, focusing on the compositionality of human actions. This paper emphasizes the natural ability of humans to perform and recognize actions regardless of the specific objects involved, an ability that current models lack. The authors propose a novel model, the Spatial-Temporal Interaction Network (STIN), which explicitly reasons about geometric relationships between agents and objects involved in an action, a step forward in understanding compositionality in action recognition.

Key Contributions and Methodology

The paper delineates several significant contributions to the field of action recognition:

Introduction of a Compositional Action Recognition Task: The researchers developed a new task on the Something-Something V2 dataset, termed Something-Else, to evaluate models based on their ability to generalize across unseen combinations of verbs and nouns. This task requires models to learn action generalization capabilities independent of the specific objects used in training.
Spatial-Temporal Interaction Networks (STIN): The proposed STIN model leverages geometric relations between agents (e.g., hands) and objects to interpret actions. It constructs sparse, semantically-grounded object graphs tracked over time, distinguished from existing models that often rely heavily on object appearance. The framework involves spatial interaction reasoning within frames and temporal reasoning across frames to capture relational dynamics characteristic of specific actions.
Few-Shot Compositional Action Recognition: The paper expands into a few-shot learning paradigm where only a limited number of examples from each novel class are available during training. Using STIN, researchers show improved generalization in video recognition tasks even with minimal training data, indicating the model's robustness and flexibility.
Integration with Appearance-Based Models: While focusing on relational reasoning, STIN can also be combined with traditional appearance-based models like I3D to enhance performance, particularly for recognizing complex actions where appearance cues are significant.
Introduction of Groundtruth Annotations: The authors annotate a vast number of frames from the Something-Something V2 dataset to provide dense object bounding box annotations, critical for training and testing in the proposed compositional settings.

Results and Implications

The experimental results illustrate notable improvements over baseline models such as I3D and STRG, particularly in tasks demanding compositional generalization. STIN showed a substantive gain in few-shot compositional settings and demonstrated superior performance in a subtask where the model was trained on actions involving a single object type, affirming its capacity for action generalization.

Implications for Future Research

This work opens up several pathways for future research. The compositional action recognition setting presents an intriguing direction to develop models that better mimic human-like generalization abilities. STIN's framework suggests a shift towards explicitly modeling object-agent interactions, offering fertile ground for exploring relationships in multi-agent or multi-object scenarios. Future work could further explore optimizing object detection modules or incorporating more sophisticated graph representations of interactions, potentially improving the model’s efficacies such as robustness against detection noise.

In conclusion, the STIN model heralds a paradigm shift in action recognition, emphasizing geometric transformation over static object features. It lays a foundation for further exploration into generalized action recognition, confronting challenges of scalability and adaptability in varying domains and applications within video-based AI systems.

Related Papers

Find Related Papers