With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition (2111.01024v1)

Published 1 Nov 2021 in cs.CV, cs.SD, and eess.AS

Abstract: In egocentric videos, actions occur in quick succession. We capitalise on the action's temporal context and propose a method that learns to attend to surrounding actions in order to improve recognition performance. To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities, with an explicit LLM providing action sequence context to enhance the predictions. We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance. Our ablations showcase the advantage of utilising temporal context as well as incorporating audio input modality and LLM to rescore predictions. Code and models at: https://github.com/ekazakos/MTCN.

PDF Abstract

Multimodal Egocentric Action Recognition with Temporal Context

The paper "With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition" presents an innovative method for recognizing actions in egocentric video streams by leveraging multimodal temporal context. The primary challenge addressed by this research is the accurate recognition of fine-grained, rapidly occurring actions commonly found in egocentric perspectives, such as those captured in cooking activities in the EPIC-KITCHENS and EGTEA datasets.

Methodology

The authors propose a transformer-based model that exploits the temporal context of actions in video streams by integrating three modalities: vision, audio, and language. The central hypothesis is that the sequence of actions provides valuable contextual information that can enhance the recognition of individual actions. This is particularly relevant for egocentric video streams where actions, though brief, often occur as part of longer, predictable sequences.

Components of the Model

Audio-Visual Transformer: The model uses a transformer architecture to process temporally ordered sequences of visual and audio inputs. Each input is augmented with both positional and modality-specific encodings to retain sequence order and modality distinctions. Notably, the model employs separate summary embeddings for verbs and nouns, allowing for independent attention to action classes and objects.
LLM Integration: A Masked LLM (MLM) is employed to capture the statistical relationship of action sequences, providing output context to rescore predictions made by the audio-visual transformer. This helps to filter out improbable sequence predictions based on the learned prior from action sequences.
Auxiliary Loss Function: The model is trained using an auxiliary loss that leverages ground-truth data of surrounding actions, further refining the prediction of the targeted action at the center of the temporal window.

Results and Implications

The paper reports state-of-the-art performance on both the EPIC-KITCHENS and EGTEA datasets. The proposed Multimodal Temporal Context Network (MTCN) demonstrates superior accuracy in recognizing actions by integrating both past and future context. Noteworthy numerical results include a top-1 action accuracy improvement of up to 8% over previous methods on EPIC-KITCHENS-100 when using visual features extracted from the SlowFast network.

The authors carefully analyze the impact of window size on model performance, concluding that larger window sizes generally enhance performance due to the increased temporal context available for recognition. Moreover, the inclusion of audio as a modality and the application of a LLM at the output prediction stage are shown to further boost performance, particularly for verb recognition.

Future Directions

This research may pave the way for advancements in real-time, context-aware action recognition systems. Future development could explore extending the model's applicability to other forms of sequential data or untrimmed video streams without predefined action boundaries. The inclusion of additional modalities, such as depth information or sensor data from wearable devices, could potentially enhance the robustness of the recognition system.

The implications of this work suggest significant potential for improving automated video analysis in domains like surveillance, autonomous navigation, and human-computer interaction, where understanding the sequence and context of actions is critical. As multimodal learning and temporal context modeling continue to evolve, they are likely to play a pivotal role in the development of more sophisticated AI systems capable of understanding complex human activities.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Evangelos Kazakos (13 papers)
Jaesung Huh (24 papers)
Arsha Nagrani (62 papers)
Andrew Zisserman (248 papers)
Dima Damen (83 papers)

Citations (44)

View on Semantic Scholar

With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition (2111.01024v1)

Multimodal Egocentric Action Recognition with Temporal Context

Methodology

Components of the Model

Results and Implications

Future Directions

Related Papers

GitHub

YouTube