Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Retrieval-Augmented Egocentric Video Captioning (2401.00789v4)

Published 1 Jan 2024 in cs.CV

Abstract: Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only, while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper, (1) we develop EgoInstructor, a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos. (2) For training the cross-view retrieval module, we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets. (3) We train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions. (4) Through extensive experiments, our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning, EgoInstructor exhibits significant improvements by leveraging third-person videos as references. Project page is available at: https://jazzcharles.github.io/Egoinstructor/

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Ego2top: Matching viewers in egocentric and top-view videos. In Proceedings of the European Conference on Computer Vision, pages 253–268. Springer, 2016.
  3. An exocentric look at egocentric actions and vice versa. Computer Vision and Image Understanding, 171:61–68, 2018.
  4. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41–46, 2023.
  5. Hiervl: Learning hierarchical video-language embeddings. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23066–23078, 2023.
  6. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  7. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  8. Whisperx: Time-accurate speech transcription of long-form audio. INTERSPEECH 2023, 2023.
  9. Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, pages 359–370, 1994.
  10. Is space-time attention all you need for video understanding? In International Conference on Machine Learning, page 4, 2021.
  11. The evolution of first person vision methods: A survey. IEEE Transactions on Circuits and Systems for Video Technology, 25(5):744–760, 2015.
  12. Improving language models by retrieving from trillions of tokens. In International Conference on Machine Learning, pages 2206–2240. PMLR, 2022.
  13. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901, 2020.
  14. Internvideo-ego4d: A pack of champion solutions to ego4d challenges. arXiv preprint arXiv:2211.09529, 2022a.
  15. Murag: Multimodal retrieval-augmented generator for open question answering over images and text. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022b.
  16. Re-imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022c.
  17. Vicuna: An open-source chatbot impressing gpt-4 with 90
  18. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision, pages 1–23, 2022.
  19. Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. 2009.
  20. Identifying first-person camera wearers in third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5125–5133, 2017.
  21. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  22. Retrieval augmented language model pre-training. In International Conference on Machine Learning, pages 3929–3938. PMLR, 2020.
  23. Temporal alignment networks for long-term video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2906–2916, 2022.
  24. Summarizing first-person videos from third persons’ points of view. In Proceedings of the European Conference on Computer Vision, pages 70–85, 2018.
  25. What is modelled during observational learning? Journal of Sports Sciences, 25(5):531–545, 2007.
  26. Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European Conference on Computer Vision, pages 754–769, 2018.
  27. Retrieval-enhanced contrastive vision-text models. arXiv preprint arXiv:2306.07196, 2023.
  28. Lemma: A multi-view dataset for le arning multi-agent multi-task activities. In Proceedings of the European Conference on Computer Vision, pages 767–786. Springer, 2020.
  29. Seeing invisible poses: Estimating 3d body pose from egocentric video. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, pages 3501–3509. IEEE, 2017.
  30. Video captioning based on both egocentric and exocentric views of robot vision for human-robot interaction. International Journal of Social Robotics, pages 1–11, 2021.
  31. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5492–5501, 2019.
  32. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of North American Chapter of the Association for Computational Linguistics-HLT, page 2, 2019.
  33. Learning navigation subroutines from egocentric videos. In Conference on Robot Learning, pages 617–626. PMLR, 2020.
  34. H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10138–10148, 2021.
  35. In the eye of transformer: Global–local correlation for egocentric gaze estimation and beyond. International Journal of Computer Vision, pages 1–18, 2023.
  36. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  37. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  38. Uniformerv2: Unlocking the potential of image vits for video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1632–1643, 2023b.
  39. Learning to predict gaze in egocentric video. In Proceedings of the IEEE International Conference on Computer Vision, pages 3216–3223, 2013.
  40. Delving into egocentric actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 287–295, 2015.
  41. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European Conference on Computer Vision, pages 619–635, 2018.
  42. Ego-exo: Transferring visual representations from third-person to first-person videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6943–6953, 2021.
  43. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586, 2022.
  44. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804, 2021.
  45. Retrieval augmented classification for long-tail visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6959–6969, 2022.
  46. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2630–2640, 2019.
  47. Ego-topo: Environment affordances from egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 163–172, 2020.
  48. Sensor-augmented egocentric-video captioning with dynamic modal attention. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4220–4229, 2021.
  49. You2me: Inferring body pose in egocentric video via first and second person interactions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9890–9900, 2020.
  50. OpenAI. Gpt-4 technical report, 2023.
  51. Does the chimpanzee have a theory of mind? Behavioral and Brain Sciences, 1(4):515–526, 1978.
  52. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  53. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
  54. Naq: Leveraging narrations as queries to supervise episodic memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6694–6703, 2023.
  55. Smallcap: lightweight image captioning prompted with retrieval augmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2840–2849, 2023.
  56. The mirror-neuron system. Annu. Rev. Neurosci., 27:169–192, 2004.
  57. Retrieval-augmented transformer for image captioning. In Proceedings of the 19th International Conference on Content-based Multimedia Indexing, pages 1–7, 2022.
  58. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022.
  59. Actor and observer: Joint modeling of first and third-person videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7396–7404, 2018.
  60. Lsta: Long short-term attention for egocentric action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9954–9963, 2019.
  61. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, commercially usable llms, 2023. Accessed: 2023-05-05.
  62. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in Neural Information Processing Systems, 35:10078–10093, 2022.
  63. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  64. Yuli Vasiliev. Natural language processing with Python and spaCy: A practical introduction. No Starch Press, 2020.
  65. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  66. Ego-only: Egocentric action detection without exocentric transferring. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5250–5261, 2023a.
  67. Learning from semantic alignment between unpaired multiviews for egocentric video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3307–3317, 2023b.
  68. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  69. Cross-view action recognition over heterogeneous feature spaces. In Proceedings of the IEEE International Conference on Computer Vision, pages 609–616, 2013.
  70. Videoclip: Contrastive pre-training for zero-shot video-text understanding. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
  71. Joint person segmentation and identification in synchronized first-and third-person videos. In Proceedings of the European Conference on Computer Vision, pages 637–652, 2018.
  72. Learning fine-grained view-invariant representations from unpaired ego-exo videos via temporal alignment. Advances in Neural Information Processing Systems, 2023.
  73. Attention prediction in egocentric video using motion and visual saliency. In Advances in Image and Video Technology: 5th Pacific Rim Symposium, PSIVT 2011, Gwangju, South Korea, November 20-23, 2011, Proceedings, Part I 5, pages 277–288. Springer, 2012.
  74. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022.
  75. Retrieval-augmented multimodal language modeling. 2023.
  76. Helping hands: An object-aware ego-centric video recognition model. In International Conference on Computer Vision, 2023.
  77. Actionformer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision, pages 492–510. Springer, 2022.
  78. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
  79. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Jilan Xu (32 papers)
  2. Yifei Huang (71 papers)
  3. Junlin Hou (19 papers)
  4. Guo Chen (107 papers)
  5. Yuejie Zhang (31 papers)
  6. Rui Feng (67 papers)
  7. Weidi Xie (132 papers)
Citations (15)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com