Retrieval-Augmented Egocentric Video Captioning (2401.00789v4)
Abstract: Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Ego2top: Matching viewers in egocentric and top-view videos. In Proceedings of the European Conference on Computer Vision, pages 253–268. Springer, 2016.
- An exocentric look at egocentric actions and vice versa. Computer Vision and Image Understanding, 171:61–68, 2018.
- Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41–46, 2023.
- Hiervl: Learning hierarchical video-language embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23066–23078, 2023.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
- Whisperx: Time-accurate speech transcription of long-form audio. INTERSPEECH 2023, 2023.
- Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pages 359–370, 1994.
- Is space-time attention all you need for video understanding? In International Conference on Machine Learning, page 4, 2021.
- The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 25(5):744–760, 2015.
- Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, pages 2206–2240. PMLR, 2022.
- Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
- Internvideo-ego4d: A pack of champion solutions to ego4d challenges. arXiv preprint arXiv:2211.09529, 2022a.
- Murag: Multimodal retrieval-augmented generator for open question answering over images and text. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022b.
- Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022c.
- Vicuna: An open-source chatbot impressing gpt-4 with 90
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, pages 1–23, 2022.
- Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. 2009.
- Identifying first-person camera wearers in third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5125–5133, 2017.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938. PMLR, 2020.
- Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022.
- Summarizing first-person videos from third persons’ points of view. In Proceedings of the European Conference on Computer Vision, pages 70–85, 2018.
- What is modelled during observational learning? Journal of Sports Sciences, 25(5):531–545, 2007.
- Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European Conference on Computer Vision, pages 754–769, 2018.
- Retrieval-enhanced contrastive vision-text models. arXiv preprint arXiv:2306.07196, 2023.
- Lemma: A multi-view dataset for le arning multi-agent multi-task activities. In Proceedings of the European Conference on Computer Vision, pages 767–786. Springer, 2020.
- Seeing invisible poses: Estimating 3d body pose from egocentric video. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 3501–3509. IEEE, 2017.
- Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction. International Journal of Social Robotics, pages 1–11, 2021.
- Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5492–5501, 2019.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of North American Chapter of the Association for Computational Linguistics-HLT, page 2, 2019.
- Learning navigation subroutines from egocentric videos. In Conference on Robot Learning, pages 617–626. PMLR, 2020.
- H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10138–10148, 2021.
- In the eye of transformer: Global–local correlation for egocentric gaze estimation and beyond. International Journal of Computer Vision, pages 1–18, 2023.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Uniformerv2: Unlocking the potential of image vits for video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1632–1643, 2023b.
- Learning to predict gaze in egocentric video. In Proceedings of the IEEE International Conference on Computer Vision, pages 3216–3223, 2013.
- Delving into egocentric actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 287–295, 2015.
- In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision, pages 619–635, 2018.
- Ego-exo: Transferring visual representations from third-person to first-person videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6943–6953, 2021.
- Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586, 2022.
- What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
- Retrieval augmented classification for long-tail visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6959–6969, 2022.
- Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
- Ego-topo: Environment affordances from egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 163–172, 2020.
- Sensor-augmented egocentric-video captioning with dynamic modal attention. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4220–4229, 2021.
- You2me: Inferring body pose in egocentric video via first and second person interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9890–9900, 2020.
- OpenAI. Gpt-4 technical report, 2023.
- Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4):515–526, 1978.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
- Naq: Leveraging narrations as queries to supervise episodic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6694–6703, 2023.
- Smallcap: lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2840–2849, 2023.
- The mirror-neuron system. Annu. Rev. Neurosci., 27:169–192, 2004.
- Retrieval-augmented transformer for image captioning. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing, pages 1–7, 2022.
- Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022.
- Actor and observer: Joint modeling of first and third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7396–7404, 2018.
- Lsta: Long short-term attention for egocentric action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9954–9963, 2019.
- MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in Neural Information Processing Systems, 35:10078–10093, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Yuli Vasiliev. Natural language processing with Python and spaCy: A practical introduction. No Starch Press, 2020.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Ego-only: Egocentric action detection without exocentric transferring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5250–5261, 2023a.
- Learning from semantic alignment between unpaired multiviews for egocentric video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3307–3317, 2023b.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- Cross-view action recognition over heterogeneous feature spaces. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–616, 2013.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
- Joint person segmentation and identification in synchronized first-and third-person videos. In Proceedings of the European Conference on Computer Vision, pages 637–652, 2018.
- Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment. Advances in Neural Information Processing Systems, 2023.
- Attention prediction in egocentric video using motion and visual saliency. In Advances in Image and Video Technology: 5th Pacific Rim Symposium, PSIVT 2011, Gwangju, South Korea, November 20-23, 2011, Proceedings, Part I 5, pages 277–288. Springer, 2012.
- An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022.
- Retrieval-augmented multimodal language modeling. 2023.
- Helping hands: An object-aware ego-centric video recognition model. In International Conference on Computer Vision, 2023.
- Actionformer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision, pages 492–510. Springer, 2022.
- Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
- Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Jilan Xu (32 papers)
- Yifei Huang (71 papers)
- Junlin Hou (19 papers)
- Guo Chen (107 papers)
- Yuejie Zhang (31 papers)
- Rui Feng (67 papers)
- Weidi Xie (132 papers)