Egocentric Video-Language Pretraining @ EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022 (2207.01334v2)

Published 4 Jul 2022 in cs.CV

Abstract: In this report, we propose a video-language pretraining (VLP) based solution \cite{kevin2022egovlp} for the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge. Especially, we exploit the recently released Ego4D dataset \cite{grauman2021ego4d} to pioneer Egocentric VLP from pretraining dataset, pretraining objective, and development set. Based on the above three designs, we develop a pretrained video-LLM that is able to transfer its egocentric video-text representation to MIR benchmark. Furthermore, we devise an adaptive multi-instance max-margin loss to effectively fine-tune the model and equip the dual-softmax technique for reliable inference. Our best single model obtains strong performance on the challenge test set with 47.39% mAP and 61.44% nDCG. The code is available at https://github.com/showlab/EgoVLP.

Authors (12)

Kevin Qinghong Lin (28 papers)
Alex Jinpeng Wang (20 papers)
Rui Yan (250 papers)
Eric Zhongcong Xu (6 papers)
Rongcheng Tu (9 papers)
Yanru Zhu (2 papers)
Wenzhe Zhao (11 papers)
Weijie Kong (11 papers)
Chengfei Cai (10 papers)
Hongfa Wang (29 papers)
Wei Liu (1135 papers)
Mike Zheng Shou (165 papers)

Citations (1)

View on Semantic Scholar

Summary

Egocentric Video-Language Pretraining for Multi-Instance Retrieval Challenge

The research paper "Egocentric Video-Language Pretraining on EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2022" tackles the significant challenge of adapting video-language pretraining (VLP) models to the domain of egocentric video data. Unlike traditional third-person view datasets, egocentric data—such as footage captured by wearable cameras—poses unique challenges and opportunities for video-language learning. This paper employs the expansive Ego4D dataset to enhance video-LLMs specifically for egocentric applications, primarily aiming to improve performance on the EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge.

Methodological Contributions

The authors introduce a pretraining framework comprising several innovative strategies: EgoClip, EgoNCE, and Adaptive MI-MM loss, which are designed to effectively bridge the gap between current VLP models and egocentric data requirements.

Pretraining Dataset Utilization: The use of the EgoClip dataset, a subset of Ego4D containing 3.85 million video-text pairs, provides diverse and rich egocentric inputs, which are crucial for robust VLP.
Pretraining Objectives: The paper introduces EgoNCE, an extension to the traditional InfoNCE loss, tailored for egocentric video-text symbiosis. This loss function employs strategies for positive and negative sampling based on contextual awareness—improving the learning through domain-specific optimizations.
Adaptive Loss Functions: For task-specific fine-tuning on the EPIC-KITCHENS-100 MIR task, the authors propose an Adaptive Multi-Instance Max-Margin (MI-MM) loss function. This innovative approach dynamically adjusts loss margins based on the semantic similarity of instances, enhancing retrieval performance.
Evaluation and Inference Techniques: Dual-softmax, an inference technique, is employed to refine and upscale similarity scores between cross-modal representations, optimizing retrieval outcomes.

Empirical Results

Empirically, the proposed egocentric VLP model achieves substantial improvements in Mean Average Precision (mAP) and normalized Discounted Cumulative Gain (nDCG) metrics on the EPIC-KITCHENS-100 MIR challenge. With a single model reporting a high mAP of 47.39% and an nDCG of 61.44%, these results underscore the advantages of tailored pretraining and the adaptive loss functions implemented within egocentric contexts.

Implications and Future Directions

This paper's theoretical and practical implications are significant:

Enhancement of Egocentric VLP: The contributions significantly boost the efficacy of VLP in egocentric domains, which is beneficial for applications in augmented reality, robotics, and human-computer interaction.
Generalization of VLP Approaches: By addressing domain-specific gaps, such methods could extend to other specialized video analyses fields, offering a template for adaptation in domain-variable environments.
Towards Unsupervised and Continual Learning: Future research might expand on this work by exploring more autonomous systems capable of handling dynamic domain variations without the need for extensive labeled datasets.

This paper exemplifies the advancements possible in video-language integration, paving the way forward in the paper and application of AI to diverse video contexts, particularly those necessitating an egocentric perspective.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - showlab/EgoVLP: [NeurIPS2022] Egocentric Video-Language Pretraining (242 stars)