Papers
Topics
Authors
Recent
2000 character limit reached

Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties (2311.17041v4)

Published 28 Nov 2023 in cs.CV, cs.AI, and cs.CL

Abstract: A major reason behind the recent success of LLMs is their \textit{in-context learning} capability, which makes it possible to rapidly adapt them to downstream text-based tasks by prompting them with a small number of relevant demonstrations. While large vision-LLMs (VLMs) have recently been developed for tasks requiring both text and images, they largely lack in-context learning over visual information, especially in understanding and generating text about videos. In this work, we implement \textbf{E}mergent \textbf{I}n-context \textbf{Le}arning on \textbf{V}ideos (\eilev{}), a novel training paradigm that induces in-context learning over video and text by capturing key properties of pre-training data found by prior work to be essential for in-context learning in transformers. In our experiments, we show that \eilev-trained models outperform other off-the-shelf VLMs in few-shot video narration for novel, rare actions. Furthermore, we demonstrate that these key properties of bursty distributions, skewed marginal distributions, and dynamic meaning each contribute to varying degrees to VLMs' in-context learning capability in narrating procedural videos. Our results, analysis, and \eilev{}-trained models yield numerous insights about the emergence of in-context learning over video and text, creating a foundation for future work to optimize and scale VLMs for open-domain video understanding and reasoning. Our code and demo are available at \url{https://github.com/yukw777/EILEV}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, pages 23716–23736. Curran Associates, Inc., 2022.
  2. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  3. Can foundation models watch, talk and guide you step by step to make a cake? arXiv preprint arXiv:2311.00738, 2023.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Data distributional properties drive emergent in-context learning in transformers. Advances in Neural Information Processing Systems, 35:18878–18891, 2022.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  7. You-do, i-learn: Egocentric unsupervised discovery of objects and their modes of interaction towards video-based guidance. Computer Vision and Image Understanding, 149:98–112, 2016.
  8. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV), pages 720–736, 2018.
  9. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  11. Social interactions: A first-person perspective. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1226–1233. IEEE, 2012.
  12. Ego4d: Around the World in 3,000 Hours of Egocentric Video. In IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2022.
  13. Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336, 2022.
  14. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
  15. Discovering important people and objects for egocentric video summarization. In 2012 IEEE conference on computer vision and pattern recognition, pages 1346–1353. IEEE, 2012.
  16. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
  17. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
  18. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  19. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  20. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European conference on computer vision (ECCV), pages 619–635, 2018.
  21. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  22. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  23. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  24. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943, 2021.
  25. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  26. Detecting activities of daily living in first-person camera views. In 2012 IEEE conference on computer vision and pattern recognition, pages 2847–2854. IEEE, 2012.
  27. Wearable-assisted localization and inspection guidance system using egocentric stereo cameras. IEEE Sensors Journal, 18(2):809–821, 2017.
  28. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019.
  29. Assembly101: A large-scale multi-view video dataset for understanding procedural activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21096–21106, 2022.
  30. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626, 2018.
  31. Detecting engagement in egocentric video. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 454–471. Springer, 2016.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  33. Holoassist: an egocentric human interaction dataset for interactive ai assistants in the real world. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20270–20281, 2023.
  34. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  35. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  36. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  37. Learning video representations from large language models. In CVPR, 2023.
  38. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
Citations (5)

Summary

  • The paper introduces EILEV, a novel approach that improves in-context learning in vision-language models for egocentric videos.
  • It adapts VLM architectures with a frozen language model interface and targeted data sampling to mimic effective text distributional properties.
  • Results demonstrate superior narrative generation and generalization across novel video scenarios while reducing dependency on massive datasets.

Efficient In-Context Learning in Vision-LLMs for Egocentric Videos

The paper "Efficient In-Context Learning in Vision-LLMs for Egocentric Videos" by Keunwoo Peter Yu et al. presents a paper on enhancing in-context learning capabilities within vision-LLMs (VLMs), specifically targeting the domain of egocentric videos. This investigation recognizes the limitations inherent in current methods that require vast collections of naturalistic data, which are both cost-prohibitive and time-consuming to acquire. The paper proposes a novel method named EILEV (Efficient In-Context Learning on Egocentric Videos) designed to elicit in-context learning capabilities in VLMs without necessitating massive datasets.

Methodological Advancements

EILEV introduces architectural adaptations and data-oriented strategies to enhance the in-context learning capabilities of VLMs, especially for egocentric video content. Key components of this approach include:

  1. Context Processing: The paper adapts existing VLM architectures to facilitate the processing of interleaved video and text data. By leveraging a frozen LLM as a universal interface, this adaptation enables the model to handle context from both modalities effectively.
  2. Data Sampling Strategy: A critical aspect of EILEV is the method for creating training data that possesses specific distributional characteristics known to support in-context learning. These include:
    • Clusters of verbs and nouns to reproduce bursty distributions.
    • Marginal distributions with a long tail of infrequent items to emphasize less common actions.
    • Inclusion of homonyms and synonyms to introduce ambiguity and reliance on context for disambiguation.

These strategies aim to replicate the properties observed in text-only LLMs which have demonstrated significant in-context learning abilities.

Evaluation and Results

The performance of the EILEV-trained models is evaluated against traditional models such as Kosmos-2 and Otter, which are trained on larger datasets. Notable outcomes of this evaluation include:

  • Superior In-Context Learning Performance: The EILEV-trained models consistently outperform larger VLMs in generating narrative descriptions for unseen egocentric video clips at various shot levels, from one-shot to sixteen-shot scenarios.
  • Generalization to Out-of-Distribution and Novel Actions: The models excel in synthesizing action narratives for not only the original dataset but also generalize effectively to novel and out-of-distribution datasets such as EPIC-KITCHENS-100. This demonstrates the model's ability to leverage in-context examples for adaptation to new, unseen tasks.
  • Minimal Dependence on In-Weights Learning: Despite the training being on common actions, the variance in performance as novel, rare actions are introduced indicates reliance on contextual information rather than solely on pre-encoded knowledge.

Implications and Future Directions

The paper highlights the potential of deploying EILEV-trained VLMs in cost-restrictive environments such as interactive task-guidance systems, where the ease of adaptation and low requirement for extensive training data are advantageous. This has significant implications for embodied AI applications and real-time processing systems requiring adaptability and context understanding.

Moreover, the results advocate for more nuanced approaches to integrating in-context learning in VLMs, moving beyond mere scale and data volume. Future research could delve further into optimizing these methods for broader applications or extending them to other modalities.

In conclusion, this paper contributes to the advancement of efficient learning paradigms within the field of vision-language integration, emphasizing strategic data and architecture configurations that may reshape the future landscape of adaptive AI systems.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com