Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InternVideo2: Scaling Foundation Models for Multimodal Video Understanding (2403.15377v4)

Published 22 Mar 2024 in cs.CV
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding

Abstract: We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our core design is a progressive training approach that unifies the masked video modeling, crossmodal contrastive learning, and next token prediction, scaling up the video encoder size to 6B parameters. At the data level, we prioritize spatiotemporal consistency by semantically segmenting videos and generating video-audio-speech captions. This improves the alignment between video and text. Through extensive experiments, we validate our designs and demonstrate superior performance on over 60 video and audio tasks. Notably, our model outperforms others on various video-related dialogue and long video understanding benchmarks, highlighting its ability to reason and comprehend longer contexts. Code and models are available at https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2/.

InternVideo2: A Comprehensive Video Foundation Model for Enhanced Multimodal Understanding

Introduction

The rapid advancement in video understanding technologies has facilitated the development of models capable of comprehending complex video contents across multiple dimensions. The paper introduces InternVideo2, a state-of-the-art Video Foundation Model (ViFM) tailored for an expansive range of video understanding tasks. This model employs a progressive training framework, integrating masked video token reconstruction, cross-modal contrastive learning, and next token prediction to cultivate a deep understanding of video semantics. This fusion of methodologies enables InternVideo2 to perform excellently across a broad spectrum of video and audio tasks.

Methodology and Innovations

InternVideo2 distinguishes itself through a nuanced progressive learning scheme that strategically enhances its capacity for spatiotemporal perception, semantic alignment across modalities, and enriched world modeling abilities.

  • Progressive Learning Scheme: At its core, InternVideo2's training is segmented into distinct stages, each focusing on a different aspect of video understanding. Initially, the model is trained to reconstruct masked video tokens, enhancing its spatiotemporal perception. Subsequently, the model is exposed to multimodal learning, incorporating audio and text for richer semantic understanding. Lastly, it undergoes next-token prediction training to polish its generative capabilities and dialogue understanding.
  • In-depth Spatiotemporal Understanding: By employing vision transformers (ViT) and exploring different pretext tasks at each stage, InternVideo2 develops a robust spatiotemporal understanding that is crucial for processing video inputs effectively.
  • Cross-modal Contrastive Learning and Semantic Alignment: The inclusion of audio and text modalities in training not only improves the model's alignment between video and auxiliary data but also broadens its applicability across various tasks.

The comprehensive methodology embraced by InternVideo2 ensures it not only learns from visual cues but also effectively integrates audio and textual contexts, making it an adept model for complex multimodal understanding tasks.

Empirical Validation and Performance

Through rigorous experimental validation, InternVideo2 has demonstrated superior competency in over 65 video and audio tasks. Significantly, it achieves state-of-the-art performance in action recognition, video-text understanding, and video-centric dialogue tasks. These outcomes are indicative of InternVideo2's ability to effectively capture, analyze, and comprehend long temporal contexts and complex multimodal data.

  • Superior Action Recognition: InternVideo2 sets new benchmarks in action recognition tasks. Its architecture and training methodology enable it to recognize and categorize actions with remarkable accuracy, outperforming its predecessors.
  • Unparalleled Video-Text Understanding: In video-text tasks, InternVideo2's ability to semantically align and reason with both the visual and textual content allows it to generate insightful and contextually relevant outputs.
  • Advanced Video-Centric Dialogue Capabilities: The model demonstrates excellent capabilities in video-centric dialogue, aiding in the development of interactive systems that can engage in meaningful exchanges based on video content.

Implications and Future Work

The development of InternVideo2 signifies a significant leap in video understanding, offering a versatile model capable of mastering a wide array of multimodal tasks. Its success heralds a new era for applications ranging from enhanced content recommendation systems to the development of sophisticated interactive agents.

Looking forward, the potential for further refining InternVideo2's training process and extending its applications is vast. Future work could explore more intricate multimodal interactions or delve into unsolved challenges within video understanding, leveraging the strong foundation laid by InternVideo2.

Conclusion

InternVideo2 represents a pivotal advancement in video foundation models, characterized by its progressive learning scheme and robust multimodal understanding capabilities. Its exemplary performance across diverse tasks underscores its effectiveness as a comprehensive tool for video understanding, promising significant contributions to both theoretical research and practical applications in the AI domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (142)
  1. Multi-modal transformer for video retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 214–229. Springer, 2020.
  2. Genie: Generative interactive environments. arXiv preprint arXiv:2402.15391, 2024.
  3. Palm-e: An embodied multimodal language model. In ICML, 2023.
  4. Explainability of deep vision-based autonomous driving systems: Review and challenges. International Journal of Computer Vision, 130(10):2425–2452, 2022.
  5. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  6. Language models are few-shot learners. In NeurIPS, 2020.
  7. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023a.
  8. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023a.
  9. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023b.
  10. OpenAI. Gpt-4v(ision) system card. https://api.semanticscholar.org/CorpusID:263218031, 2023b.
  11. Multimodal-gpt: A vision and language model for dialogue with humans. ArXiv, abs/2305.04790, 2023.
  12. Visual instruction tuning, 2023.
  13. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
  14. OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023c.
  15. Video as the new language for real-world decision making. arXiv preprint arXiv:2402.17139, 2024.
  16. Masked autoencoders are scalable vision learners. In CVPR, 2022.
  17. Videomae v2: Scaling video masked autoencoders with dual masking. In CVPR, 2023a.
  18. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
  19. Video-text modeling with zero-shot transfer from contrastive captioners. ArXiv, abs/2212.04979, 2022.
  20. Unmasked teacher: Towards training-efficient video foundation models. arXiv preprint arXiv:2303.16058, 2023a.
  21. Flamingo: a visual language model for few-shot learning. ArXiv, abs/2204.14198, 2022.
  22. Generative pretraining in multimodality. arXiv preprint arXiv:2307.05222, 2023a.
  23. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005, 2023b.
  24. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  25. Videoprism: A foundational visual encoder for video understanding, 2024.
  26. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023a.
  27. Learning spatiotemporal features via video and text pair discrimination. CoRR, abs/2001.05691, 2020. URL https://arxiv.org/abs/2001.05691.
  28. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552, 2022a.
  29. Scaling up vision-language pre-training for image captioning. In CVPR, pages 17980–17989, 2022.
  30. An empirical study of training end-to-end vision-and-language transformers. In CVPR, 2022.
  31. How much can clip benefit vision-and-language tasks? arXiv preprint arXiv:2107.06383, 2021.
  32. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  33. Videobert: A joint model for video and language representation learning. ICCV, 2019.
  34. Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. CVPR, 2020.
  35. Internvideo-ego4d: A pack of champion solutions to ego4d challenges. arXiv preprint arXiv:2211.09529, 2022a.
  36. Merlot: Multimodal neural script knowledge models. NeurIPS, 34:23634–23651, 2021.
  37. Merlot reserve: Neural script knowledge through vision and language and sound. In CVPR, pages 16375–16387, 2022.
  38. Learning transferable spatiotemporal representations from natural script knowledge. In CVPR, pages 23079–23089, 2023a.
  39. Tvtsv2: Learning out-of-the-box spatiotemporal visual representations at scale. arXiv preprint arXiv:2305.14173, 2023b.
  40. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023b.
  41. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023c.
  42. Vlab: Enhancing video language pre-training by feature adapting and blending. arXiv preprint arXiv:2305.13167, 2023.
  43. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. Advances in Neural Information Processing Systems, 36, 2024a.
  44. Masked video distillation: Rethinking masked feature modeling for self-supervised video representation learning. In CVPR, pages 6312–6322, 2023b.
  45. Masked autoencoders as spatiotemporal learners. ArXiv, abs/2205.09113, 2022.
  46. All in one: Exploring unified video-language pre-training. In CVPR, pages 6598–6608, 2023c.
  47. End-to-end learning of visual representations from uncurated instructional videos. In CVPR, 2020.
  48. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. NeurIPS, 35:10078–10093, 2022.
  49. Harvest video foundation models via efficient post-pretraining. arXiv preprint arXiv:2310.19554, 2023c.
  50. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  51. Generative multimodal models are in-context learners. arXiv preprint arXiv:2312.13286, 2023b.
  52. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023.
  53. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2018.
  54. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  55. Finetuned language models are zero-shot learners. In ICLR, 2021.
  56. Palm: Scaling language modeling with pathways. JMLR, 2022.
  57. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017a.
  58. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In ICCV, 2015.
  59. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
  60. Ok-vqa: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  61. Minigpt-4: Enhancing vision-language understanding with advanced large language models. ArXiv, abs/2304.10592, 2023b.
  62. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.
  63. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv, abs/2306.05424, 2023a.
  64. Valley: Video assistant with large language model enhanced ability. ArXiv, abs/2306.07207, 2023.
  65. An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020.
  66. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  67. BEATs: Audio pre-training with acoustic tokenizers. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 5178–5193. PMLR, 23–29 Jul 2023d.
  68. Vindlu: A recipe for effective video-and-language pretraining. ArXiv, abs/2212.05051, 2022.
  69. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2022b.
  70. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML, 2022c.
  71. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  72. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  73. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
  74. The “something something” video database for learning and evaluating visual common sense. In ICCV, 2017b.
  75. Moments in time dataset: One million videos for event understanding. TPAMI, 2020.
  76. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  77. Hacs: Human action clips and segments dataset for recognition and temporal localization. In ICCV, 2019.
  78. Autoshot: A short video dataset and state-of-the-art shot boundary detection. In CVPR, pages 2237–2246, 2023c.
  79. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023d.
  80. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  81. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023d.
  82. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research, 2023.
  83. Flashattention: Fast and memory-efficient exact attention with io-awareness. NeurIPS, 35:16344–16359, 2022.
  84. A short note about kinetics-600. ArXiv, abs/1808.01340, 2018.
  85. A short note on the kinetics-700 human action dataset. ArXiv, abs/1907.06987, 2019.
  86. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  87. Hmdb: a large video database for human motion recognition. In 2011 International conference on computer vision, pages 2556–2563. IEEE, 2011.
  88. Tall: Temporal activity localization via language query. In ICCV, 2017.
  89. Co-training transformer with videos and images improves action recognition. ArXiv, abs/2112.07175, 2021.
  90. Hiera: A hierarchical vision transformer without the bells-and-whistles, 2023.
  91. V-JEPA: Latent video prediction for visual representation learning, 2024. URL https://openreview.net/forum?id=WFYbBOEOtv.
  92. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  93. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
  94. Reproducible scaling laws for contrastive language-image learning. In CVPR, pages 2818–2829, 2023.
  95. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763. PMLR, 2021a.
  96. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023c.
  97. Eva-clip-18b: Scaling clip to 18 billion parameters. arXiv preprint arXiv:2402.04252, 2024.
  98. The thumos challenge on action recognition for videos “in the wild”. Computer Vision and Image Understanding, 2017.
  99. Dense-captioning events in videos. In ICCV, 2017.
  100. Fineaction: A fine-grained video dataset for temporal action localization. IEEE Transactions on Image Processing, 31:6937–6950, 2022.
  101. A closer look at spatiotemporal convolutions for action recognition. In CVPR, 2018.
  102. Actionformer: Localizing moments of actions with transformers. ArXiv, abs/2202.07925, 2022.
  103. Bmn: Boundary-matching network for temporal action proposal generation, 2019.
  104. Dcan: improving temporal action detection via dual context aggregation. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pages 248–257, 2022b.
  105. Basictad: an astounding rgb-only baseline for temporal action detection. Computer Vision and Image Understanding, page 103692, 2023.
  106. Video instance segmentation. In ICCV, 2019.
  107. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021.
  108. Per-pixel classification is not all you need for semantic segmentation. In NeurIPS, 2021.
  109. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022c.
  110. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing, 2022.
  111. A dataset for movie description. In CVPR, pages 3202–3212, 2015.
  112. Localizing moments in video with natural language. In ICCV, pages 5803–5812, 2017.
  113. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1, pages 190–200. Association for Computational Linguistics, 2011.
  114. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  115. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. arXiv preprint arXiv:2402.19479, 2024b.
  116. mplug-owl: Modularization empowers large language models with multimodality. ArXiv, abs/2304.14178, 2023a.
  117. Egoschema: A diagnostic benchmark for very long-form video language understanding. In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 46212–46244. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/90ce332aff156b910b002ce4e6880dec-Paper-Datasets_and_Benchmarks.pdf.
  118. Qvhighlights: Detecting moments and highlights in videos via natural language queries, 2021.
  119. Slowfast networks for video recognition. In ICCV, 2019.
  120. Correlation-guided query-dependency calibration in video representation learning for temporal grounding. arXiv preprint arXiv:2311.08835, 2023.
  121. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023.
  122. Learning transferable visual models from natural language supervision. In ICML, 2021b.
  123. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019.
  124. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE, 2020.
  125. Connecting the dots between audio and text without parallel data through visual knowledge transfer. arXiv preprint arXiv:2112.08995, 2021.
  126. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023d.
  127. Clotho-aqa: A crowdsourced dataset for audio question answering. In 2022 30th European Signal Processing Conference (EUSIPCO), pages 1140–1144. IEEE, 2022.
  128. Aquallm: Audio question answering data generation using large language models. arXiv preprint arXiv:2312.17343, 2023.
  129. Karol J Piczak. Esc: Dataset for environmental sound classification. In Proceedings of the 23rd ACM international conference on Multimedia, pages 1015–1018, 2015.
  130. Multi-scale attention for audio question answering. arXiv preprint arXiv:2305.17993, 2023e.
  131. Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778, 2021.
  132. Movqa: A benchmark of versatile question-answering for long-form movie understanding. arXiv preprint arXiv:2312.04817, 2023b.
  133. Video-llama: An instruction-tuned audio-visual language model for video understanding. ArXiv, abs/2306.02858, 2023c.
  134. mplug-owl: Modularization empowers large language models with multimodality, 2023b.
  135. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  136. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021.
  137. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759, 2016.
  138. Seamlessm4t: Massively multilingual & multimodal machine translation, 2023.
  139. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  140. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023f.
  141. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023b.
  142. Perception test: A diagnostic benchmark for multimodal video models. Advances in Neural Information Processing Systems, 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (20)
  1. Yi Wang (1038 papers)
  2. Xinhao Li (29 papers)
  3. Jiashuo Yu (19 papers)
  4. Yinan He (34 papers)
  5. Guo Chen (107 papers)
  6. Baoqi Pei (10 papers)
  7. Rongkun Zheng (6 papers)
  8. Jilan Xu (32 papers)
  9. Zun Wang (42 papers)
  10. Yansong Shi (5 papers)
  11. Tianxiang Jiang (5 papers)
  12. Songze Li (73 papers)
  13. Hongjie Zhang (21 papers)
  14. Yifei Huang (71 papers)
  15. Yu Qiao (563 papers)
  16. Yali Wang (78 papers)
  17. Limin Wang (221 papers)
  18. KunChang Li (43 papers)
  19. Chenting Wang (6 papers)
  20. Ziang Yan (40 papers)
Citations (5)