Where a Strong Backbone Meets Strong Features -- ActionFormer for Ego4D Moment Queries Challenge (2211.09074v1)

Published 16 Nov 2022 in cs.CV

Abstract: This report describes our submission to the Ego4D Moment Queries Challenge 2022. Our submission builds on ActionFormer, the state-of-the-art backbone for temporal action localization, and a trio of strong video features from SlowFast, Omnivore and EgoVLP. Our solution is ranked 2nd on the public leaderboard with 21.76% average mAP on the test set, which is nearly three times higher than the official baseline. Further, we obtain 42.54% Recall@1x at tIoU=0.5 on the test set, outperforming the top-ranked solution by a significant margin of 1.41 absolute percentage points. Our code is available at https://github.com/happyharrycn/actionformer_release.

Authors (4)

Fangzhou Mu (18 papers)
Sicheng Mo (9 papers)
Gillian Wang (1 paper)
Yin Li (150 papers)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a two-step temporal action localization approach that combines robust clip-level feature extraction with transformer-based sequence modeling.
The paper reports significant gains by increasing mAP from 5.68% to 21.76% and Recall@1x to 42.54% at tIoU=0.5 through optimized feature fusion.
The paper demonstrates that integrating egocentric pre-training with EgoVLP enhances model scalability and adaptability for complex video contexts.

Overview of ActionFormer for Ego4D Moment Queries Challenge

This paper presents a submission to the Ego4D Moment Queries (MQ) Challenge 2022, emphasizing advancements in temporal action localization through the integration of ActionFormer and robust video features. ActionFormer, a transformer-based backbone, is employed alongside SlowFast, Omnivore, and EgoVLP feature networks, highlighting a synergy between strong architectures and enriched video features, especially for egocentric video contexts.

Methodological Strengths

The paper details a two-step approach common in temporal action localization. Initially, clip-level features are extracted from videos using networks such as SlowFast, grounded in Kinetics, and EgoVLP, pre-trained with the Ego4D dataset. These features guide the second stage, where ActionFormer processes the sequences to determine action onsets and offsets.

Feature Extraction and Fusion

The authors meticulously select multiple pre-trained networks for feature extraction:

SlowFast is preferred for third-person action dynamics.
Omnivore and EgoVLP further enrich the feature set, with EgoVLP distinctly enhancing performance due to its egocentric pre-training.

For feature fusion, a layered projection method is used to harmonize the dimensions of input features before concatenation. This approach to fusion reportedly surpasses simple concatenation, underscoring the importance of feature dimensionality management.

Performance Evaluation

The experimental results illustrate significant performance improvements over baseline methods. When combining SlowFast, Omnivore, and EgoVLP features, the model achieves an average mAP of 21.76% on the test set, substantially more effective compared to the baseline at 5.68%. Furthermore, an increase to 42.54% Recall@1x at tIoU=0.5 exceeds baseline results and highlights a 1.41 percentage point advantage over the top-ranked solution.

The addition of EgoVLP notably elevates both mAP and Recall, reinforcing the hypothesis that egocentric-specific pre-training profoundly influences performance on relevant downstream tasks.

Contributions and Implications

This work offers several implications for future research in action localization:

Egocentric Video Adaptation: Demonstrating that architectures traditionally applied to third-person data can benefit from egocentric adaptations using pre-trained models like EgoVLP.
Feature Fusion Strategies: Providing evidence that optimal feature fusion strategies can extract more actionable insights from complex video data.
Scalability and Flexibility: Showcasing a scalable model that maintains flexibility across different video contexts, leveraging transformers' inherent capabilities for temporal reasoning.

Future Directions

The paper acknowledges the potential gains from incorporating LLMs to address overlooked textual descriptions and relationships within video content. Additionally, integrating elements unique to egocentric perspectives, such as ego-motion or object presence, could refine contextual understanding.

The promise of future developments in AI, particularly in egocentric vision applications, is evident. As the integration of multimodal learning with video content progresses, the methodologies outlined in this work provide a foundation for augmenting temporal action localization paradigms.

In conclusion, the work adeptly brings together state-of-the-art models and feature networks to offer a compelling solution for the Ego4D MQ task, reflecting a significant stride in both methodological and applied machine learning domains.

PDF Markdown

Related Papers

GitHub

GitHub - happyharrycn/actionformer_release: Code release for ActionFormer (ECCV 2022) (397 stars)