Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distilling Vision-Language Models on Millions of Videos (2401.06129v2)

Published 11 Jan 2024 in cs.CV

Abstract: The recent advance in vision-LLMs is largely attributed to the abundance of image-text data. We aim to replicate this success for video-LLMs, but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-LLM from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-LLM performs well on a wide range of video-language benchmarks. For instance, it surpasses the best prior result on open-ended NExT-QA by 2.8%. Besides, our model generates detailed descriptions for previously unseen videos, which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-LLMs. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product, we generate the largest video caption dataset to date.

Overview of the Paper

In this paper, an innovative approach is presented for adapting image-based vision-LLMs (VLMs) to videos. The researchers have developed a method to address the scarcity of human-labeled video data by generating high-quality pseudo-captions from millions of web-scraped videos. This approach entailed fine-tuning a VLM on video-captioning data, followed by auto-generating video descriptions to train a video-language dual-encoder model. This dual-encoder model demonstrated state-of-the-art performance on various benchmarks such as the MSR-VTT for text-to-video retrieval.

Methodology

The adaptation process is twofold. Firstly, the visual component of the VLM is refined using video captions, optimizing for scene dynamics over static appearance. Here, the LLM is kept constant to avoid degradation from simple, repetitive patterns in video text-data. Secondly, the LLM is then tuned with instruction-following data—questions and answers prompted by another LLM—while the visual encoder remains unchanged.

The approach is made robust by utilizing instruction-following data that emphasize causal and temporal reasoning. This method both enriches the model's inference capabilities and ensures diversity and detail in generated video pseudo-captions.

Benefits of Pseudo-Captions

The pseudo-captioning process offers several advantages. They are relevant to video content and capture the temporal dynamics that image-based captions miss. Moreover, the LLM can generate multiple captions simultaneously, leading to a scalable annotation process. This method provides detailed descriptions that greatly enhance the quality of textual supervision compared to existing methods.

Evaluating the Adapted Model

The adapted model's effectiveness was assessed through a range of video-language benchmarks, illustrating improvements across the board. With the pseudo-captions used to pre-train a dual-encoder model, a scaling trend was observed with the model's performance increasing with the amount of data. In contrastive pre-training, models trained on pseudo-captions significantly outperformed those trained on original video dataset captions for text-to-video retrieval and video classification tasks.

Summary and Impact

The technique developed in this paper for adapting VLMs to video has made great strides in video-language understanding. It is reflected in the notable performance upgrades seen in zero-shot video retrieval and classification tasks. This advancement, particularly in the context of scarce video-text data, paves the way for more nuanced and sophisticated multimodal AI systems that can effectively analyze and understand video content at scale.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  3. Mixmatch: A holistic approach to semi-supervised learning. In NeurIPS, 2019.
  4. Language models are few-shot learners. In NeurIPS, 2020.
  5. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  6. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  7. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  8. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023a.
  9. Pali-3 vision language models: Smaller, faster, stronger. arXiv preprint arXiv:2310.09199, 2023b.
  10. Pali: A jointly-scaled multilingual language-image model. In ICLR, 2023c.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  12. Epic-kitchens visor benchmark: Video segmentations and object relations. In NeurIPS D&B, 2022.
  13. Next-generation deep learning based on simulators and synthetic data. Trends in cognitive sciences, 2022.
  14. Flownet: Learning optical flow with convolutional networks. In ICCV, 2015.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. Omni-sourced webly-supervised learning for video recognition. In ECCV, 2020.
  17. Oops! predicting unintentional action in video. In CVPR, 2020.
  18. Large-scale weakly-supervised pre-training for video action recognition. In CVPR, 2019.
  19. Google. Palm 2 technical report, 2023.
  20. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  21. Ego4d: Around the world in 3,000 hours of egocentric video. In CVPR, 2022.
  22. Temporal alignment networks for long-term video. In CVPR, 2022.
  23. Deep residual learning for image recognition. In CVPR, 2016.
  24. The curious case of neural text degeneration. In ICLR, 2020.
  25. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  26. Segment anything. In ICCV, 2023.
  27. Dense-captioning events in videos. In ICCV, 2017.
  28. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP, 2018.
  29. The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021.
  30. Fast inference from transformers via speculative decoding. In ICML, 2023.
  31. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023a.
  32. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  33. Decap: Decoding clip latents for zero-shot captioning via text-only training. In ICLR, 2023c.
  34. Microsoft coco: Common objects in context. In ECCV, 2014.
  35. Visual instruction tuning. In NeurIPS, 2023.
  36. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
  37. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  38. Generating training data with language models: Towards zero-shot language understanding. In NeurIPS, 2022.
  39. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In ICCV, 2019.
  40. Spoken moments: Learning joint audio-visual representations from video descriptions. In CVPR, 2021.
  41. Learning audio-video modalities from image captions. In ECCV, 2022.
  42. Improving multimodal datasets with image captioning. In NeurIPS D&B, 2023.
  43. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  44. OpenAI. Gpt-4v(ision) system card, 2023.
  45. Occluded video instance segmentation: A benchmark. IJCV, 2022.
  46. Language models are unsupervised multitask learners. OpenAI blog, 2019.
  47. Learning transferable visual models from natural language supervision. In ICML, 2021.
  48. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  49. Playing for data: Ground truth from computer games. In ECCV, 2016.
  50. Laion-5b: An open large-scale dataset for training next generation image-text models. In NeurIPS D&B, 2022.
  51. Neural machine translation of rare words with subword units. In ACL, 2016.
  52. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  53. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. In NeurIPS, 2020.
  54. Ul2: Unifying language learning paradigms. In ICLR, 2023.
  55. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  56. Connecting vision and language with video localized narratives. In CVPR, 2023.
  57. Unidentified video objects: A benchmark for dense, open-world segmentation. In ICCV, 2021.
  58. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In CVPR, 2023a.
  59. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
  60. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  61. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023b.
  62. Self-instruct: Aligning language model with self generated instructions. In ACL, 2023c.
  63. Finetuned language models are zero-shot learners. In ICLR, 2022.
  64. Next-qa: Next phase of question-answering to explaining temporal actions. In CVPR, 2021.
  65. Video question answering via gradually refined attention over appearance and motion. In ACM MM, 2017.
  66. Videoclip: Contrastive pre-training for zero-shot video-text understanding. In EMNLP, 2021.
  67. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  68. Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
  69. Zero-shot video question answering via frozen bidirectional language models. NeurIPS, 2022.
  70. Aim: Adapting image models for efficient video action recognition. In ICLR, 2023a.
  71. Mitigating spurious correlations in multi-modal models during fine-tuning. In ICML, 2023b.
  72. Coca: Contrastive captioners are image-text foundation models. TMLR, 2022.
  73. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, 2019.
  74. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  75. Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, 2022.
  76. Scaling vision transformers. In CVPR, 2022a.
  77. LiT: Zero-shot transfer with locked-image text tuning. In CVPR, 2022b.
  78. Training a large video model on a single machine in a day. arXiv preprint arXiv:2309.16669, 2023.
  79. Learning video representations from large language models. In CVPR, 2023.
  80. Detecting twenty-thousand classes using image-level supervision. In ECCV, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Yue Zhao (394 papers)
  2. Long Zhao (64 papers)
  3. Xingyi Zhou (26 papers)
  4. Jialin Wu (30 papers)
  5. Chun-Te Chu (3 papers)
  6. Hui Miao (11 papers)
  7. Florian Schroff (21 papers)
  8. Hartwig Adam (49 papers)
  9. Ting Liu (329 papers)
  10. Boqing Gong (100 papers)
  11. Philipp Krähenbühl (55 papers)
  12. Liangzhe Yuan (19 papers)
Citations (8)