GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition (2401.10039v2)

Published 18 Jan 2024 in cs.CV

Abstract: Vision-LLMs (VLMs), pre-trained on large-scale datasets, have shown impressive performance in various visual recognition tasks. This advancement paves the way for notable performance in Zero-Shot Egocentric Action Recognition (ZS-EAR). Typically, VLMs handle ZS-EAR as a global video-text matching task, which often leads to suboptimal alignment of vision and linguistic knowledge. We propose a refined approach for ZS-EAR using VLMs, emphasizing fine-grained concept-description alignment that capitalizes on the rich semantic and contextual details in egocentric videos. In this paper, we introduce GPT4Ego, a straightforward yet remarkably potent VLM framework for ZS-EAR, designed to enhance the fine-grained alignment of concept and description between vision and language. Extensive experiments demonstrate GPT4Ego significantly outperforms existing VLMs on three large-scale egocentric video benchmarks, i.e., EPIC-KITCHENS-100 (33.2%, +9.4%), EGTEA (39.6%, +5.5%), and CharadesEgo (31.5%, +2.6%).

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (5)

Guangzhao Dai (4 papers)
Xiangbo Shu (39 papers)
Wenhao Wu (71 papers)
Rui Yan (250 papers)
Jiachao Zhang (6 papers)

Citations (4)

View on Semantic Scholar

YouTube

Show All Videos

GPT4Ego: Unleashing the Potential of Pre-trained Models for Zero-Shot Egocentric Action Recognition (2401.10039v2)

Related Papers

YouTube