What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention (1905.09035v2)

Published 22 May 2019 in cs.CV and cs.AI

Abstract: Egocentric action anticipation consists in understanding which objects the camera wearer will interact with in the near future and which actions they will perform. We tackle the problem proposing an architecture able to anticipate actions at multiple temporal scales using two LSTMs to 1) summarize the past, and 2) formulate predictions about the future. The input video is processed considering three complimentary modalities: appearance (RGB), motion (optical flow) and objects (object-based features). Modality-specific predictions are fused using a novel Modality ATTention (MATT) mechanism which learns to weigh modalities in an adaptive fashion. Extensive evaluations on two large-scale benchmark datasets show that our method outperforms prior art by up to +7% on the challenging EPIC-Kitchens dataset including more than 2500 actions, and generalizes to EGTEA Gaze+. Our approach is also shown to generalize to the tasks of early action recognition and action recognition. Our method is ranked first in the public leaderboard of the EPIC-Kitchens egocentric action anticipation challenge 2019. Please see our web pages for code and examples: http://iplab.dmi.unict.it/rulstm - https://github.com/fpv-iplab/rulstm.

Authors (2)

Antonino Furnari (46 papers)
Giovanni Maria Farinella (50 papers)

Citations (165)

View on Semantic Scholar

Summary

Overview of "What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention"

The paper "What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention" by Antonino Furnari and Giovanni Maria Farinella presents an advanced approach for egocentric action anticipation. This task involves predicting the actions and object interactions a user will engage in, focusing on the field of First Person Vision. The authors propose a novel architecture that leverages Rolling-Unrolling LSTMs along with a Modality Attention (MATT) mechanism.

Methodology

The core innovation involves using two interconnected Long Short-Term Memory networks (LSTMs) within a Rolling-Unrolling framework. These networks serve distinct purposes: the Rolling LSTM encodes streaming past observations, while the Unrolling LSTM projects future actions by iterating over the latest hidden and cell states of the Rolling LSTM. This temporal division enhances the system's ability to stabilize past summaries and future predictions independently.

The authors process video data through three complementary modalities: appearance (RGB), motion (optical flow), and objects. Each modality is handled by a separate branch with its predictions fused through the MATT mechanism. MATT adaptively weighs these modalities, capturing their differing informative contributions based on context. The multi-modal processing helps in tackling the uncertainties and dynamics characteristic of egocentric videos.

Empirical Evaluation

Comprehensive evaluations are conducted on two large-scale datasets: EPIC-Kitchens and EGTEA Gaze+. The authors report that their approach outperforms existing state-of-the-art models by notable margins, particularly within the challenging EPIC-Kitchens dataset, achieving up to a 7% performance increase. Furthermore, their architecture is versatile enough to extend to tasks such as early action recognition, demonstrating superior performance across these related domains.

The experimental results include specific metrics: Top-5 Accuracy and Mean Top-5 Recall for both verbs and nouns, highlighting the system’s effectiveness in accounting for class imbalance. The paper introduces an evaluation of "Mean Time to Action," offering insights into temporal prediction efficacy.

Implications and Future Directions

The presented methodology holds substantial implications for developing more responsive and intelligent wearable systems that anticipate user actions to facilitate interaction or safety measures. This is particularly pertinent for applications like augmented reality and assisted living. The paper also prompts further research into enhancing egocentric action anticipation through deeper glimpses into temporal dynamics and interactions.

Possible future developments include refining modality attention mechanisms to further personalize the anticipation process and extending the methodology’s flexibility to encompass a broader range of input modalities. Moreover, further exploration into the end-to-end learning aspect could potentially circumvent some limitations outlined regarding specificity in feature extraction.

In summary, the authors provide a significant step forward in action anticipation research, advancing theoretical and practical understanding through an effective combination of sequential learning and modality fusion techniques. The paper’s contributions illuminate promising trajectories for the evolution of anticipatory systems in the domain of artificial intelligence.

PDF Markdown

What Would You Expect? Anticipating Egocentric Actions with Rolling-Unrolling LSTMs and Modality Attention (1905.09035v2)

Summary

Overview of "What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention"

Methodology

Empirical Evaluation

Implications and Future Directions

Related Papers

GitHub

YouTube