Retrieval-Augmented Egocentric Video Captioning (2401.00789v4)

Published 1 Jan 2024 in cs.CV

Abstract: Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/

References (79)

Authors (7)

Jilan Xu (32 papers)
Yifei Huang (71 papers)
Junlin Hou (19 papers)
Guo Chen (107 papers)
Yuejie Zhang (31 papers)
Rui Feng (67 papers)
Weidi Xie (132 papers)

Citations (15)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/realmofresearch/status/1793193341905183059

Retrieval-Augmented Egocentric Video Captioning (2401.00789v4)

Summary

Related Papers

Tweets