Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video Understanding with Large Language Models: A Survey (2312.17432v4)

Published 29 Dec 2023 in cs.CV and cs.CL
Video Understanding with Large Language Models: A Survey

Abstract: With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of LLMs in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

Introduction

LLMs have secured a prominent place in AI advancements, and their convergence with video content has birthed a new interdisciplinary field that combines language and imagery for comprehensive video understanding. This comes at a pivotal time when online video content has burgeoned into the dominant form of media consumption, pushing the boundaries of traditional analysis technologies. The essence of LLMs in video analysis—Video LLMs, or Vid-LLMs—lies in their ability to imbibe spatial-temporal contexts and infer knowledge, propelling strides in video understanding tasks.

Foundations and Taxonomy

Vid-LLMs have emerged out of the rich history of video understanding, transcending conventional methods and neural network models to exploit self-supervised pretraining, and now, most recently, integrating the broad contextual understanding offered by LLMs into video analysis. Vid-LLMs are being constantly improved and can be structurally categorized mainly into four types: LLM-based video agents, pretraining methods, instruction tuning, and hybrid approaches.

The Role of Language and Adapters in Video Understanding

Language, being the bedrock of LLMs, plays a dual role—encoding and decoding. Adapters are pivotal in marrying video modality with LLMs, where their task is to translate inputs from different modalities into a common language domain. These adapters can range from simple projection layers to complex cross-attention mechanisms, making them crucial for an efficient marriage between LLMs and video content.

Vid-LLMs: Models in Action

Recent implementations of Vid-LLMs showcase their utility in various tasks such as video captioning, action recognition, and more. These models leverage a combination of visual encoders and adapters, orchestrating not just the synthesis of detailed text descriptions but also responding to intricate questions regarding video content. This indicates a major shift from classical methods, which focused narrowly on categorizing video into predefined labels, towards more versatile approaches capable of processing hundreds of frames for nuanced generation and contextual comprehension.

Evaluating Performance and Applications

Several tasks form the crux of video understanding, such as recognition, captioning, grounding, retrieval, and question answering. A wide spectrum of datasets caters to these tasks, ranging from user-generated content to finely annotated movie descriptions. Evaluation metrics, essential for assessing Vid-LLMs, are borrowed from both the computer vision and NLP domains, encompassing metrics like accuracy, BLEU, METEOR, and others.

Future Trajectories and Current Limitations

Despite remarkable progress, challenges remain. Fine-grained understanding, handling long video durations, and ensuring model responses genuinely reflect video content without hallucination are pressing issues. Applications of advanced Vid-LLMs span across various domains from media and entertainment to healthcare and security, highlighting their transformative potential across industries. As research propels forward, addressing limitations such as hallucination and enhancing multi-modal integration are identified as fertile ground for growing the capabilities and applications of Vid-LLMs.

In summary, Vid-LLMs stand at the cusp of revolutionizing video understanding, taking large strides in task-solving capabilities to address the deluge of video content burgeoning in today's digital age. They hold the promise of transforming video analysis, from a labor-intensive manual process to a sophisticated, elegant orchestration of artificial intelligence technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (226)
  1. Tony Lindeberg. Scale invariant feature transform. 2012.
  2. Speeded-up robust features (surf). Computer vision and image understanding, 110(3):346–359, 2008.
  3. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), volume 1, pages 886–893. Ieee, 2005.
  4. A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos. Computer Vision and Image Understanding, 122:4–21, 2014.
  5. Optical flow for video super-resolution: a survey. Artificial Intelligence Review, 55(8):6505–6546, 2022.
  6. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pages 3551–3558, 2013.
  7. Action detection with improved dense trajectories and sliding window. In Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part I 13, pages 541–551. Springer, 2015.
  8. Video-based face recognition using adaptive hidden markov models. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., volume 1, pages I–I. IEEE, 2003.
  9. Hedvig Sidenbladh. Detecting human motion with support vector machines. In Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., volume 2, pages 188–191. IEEE, 2004.
  10. Automatic video classification using decision tree method. In Proceedings. International Conference on Machine Learning and Cybernetics, volume 3, pages 1153–1157. IEEE, 2002.
  11. Modeling, clustering, and segmenting video with mixtures of dynamic textures. IEEE transactions on pattern analysis and machine intelligence, 30(5):909–926, 2008.
  12. Robust pca via principal component pursuit: A review for a comparative evaluation in video surveillance. Computer Vision and Image Understanding, 122:22–34, 2014.
  13. Video-based fall detection in the home using principal component analysis. In Advanced Concepts for Intelligent Vision Systems: 10th International Conference, ACIVS 2008, Juan-les-Pins, France, October 20-24, 2008. Proceedings 10, pages 298–309. Springer, 2008.
  14. Large-scale video classification with convolutional neural networks. In CVPR, 2014.
  15. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941, 2016.
  16. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702, 2015.
  17. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
  18. Human action recognition based on multi-layer fisher vector encoding method. Pattern Recognition Letters, 65:37–43, 2015.
  19. Deep temporal linear encoding networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2329–2338, 2017.
  20. A modified vector of locally aggregated descriptors approach for fast video classification. Multimedia Tools and Applications, 75:9045–9072, 2016.
  21. Transvlad: Focusing on locally aggregated descriptors for few-shot learning. In European Conference on Computer Vision, pages 524–540. Springer, 2022.
  22. Ucf101: A dataset of 101 human actions classes from videos in the wild. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  23. Hmdb: A large video database for human motion recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2011.
  24. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015.
  25. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
  26. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
  27. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  28. The" something something" video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision, pages 5842–5850, 2017.
  29. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  30. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  31. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  32. Learning spatio-temporal features with 3d residual networks for action recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, Oct 2017.
  33. Multi-fiber networks for video recognition. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  34. Spatio-temporal channel correlation networks for action classification. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  35. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  36. Eco: Efficient convolutional network for online video understanding. In Proceedings of the European conference on computer vision (ECCV), pages 695–712, 2018.
  37. Learning spatio-temporal representation with pseudo-3d residual networks. In proceedings of the IEEE International Conference on Computer Vision, pages 5533–5541, 2017.
  38. Long-term temporal convolutions for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(6):1510–1517, 2018.
  39. Temporal 3d convnets: New architecture and transfer learning for video classification. CoRR, abs/1711.08200, 2017.
  40. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  41. V4d: 4d convolutional neural networks for video-level representation learning. In International Conference on Learning Representations, 2019.
  42. Video classification with channel-separated convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5552–5561, 2019.
  43. Slowfast networks for video recognition. In Proceedings of the International Conference on Computer Vision (ICCV), 2019.
  44. Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 203–213, 2020.
  45. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  46. Is space-time attention all you need for video understanding? In ICML, volume 2, page 4, 2021.
  47. Vidtr: Video transformer without convolutions. arXiv e-prints, pages arXiv–2104, 2021.
  48. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846, 2021.
  49. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835, 2021.
  50. Cross contrasting feature perturbation for domain generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1327–1337, 2023.
  51. Feature alignment and uniformity for test time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20050–20060, 2023.
  52. Rethinking alignment and uniformity in unsupervised image semantic segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 11192–11200, 2023.
  53. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473, 2019.
  54. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186, 2019.
  55. Linchao Zhu and Yi Yang. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8746–8755, 2020.
  56. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35:35946–35958, 2022.
  57. Omnimae: Single model masked pretraining on images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10406–10417, 2023.
  58. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems, 35:10078–10093, 2022.
  59. Self-supervised video representation learning with motion-aware masked autoencoders. arXiv preprint arXiv:2210.04154, 2022.
  60. Masked feature prediction for self-supervised visual pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14668–14678, 2022.
  61. Vlm: Task-agnostic video-language model pre-training for video understanding. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4227–4239, 2021.
  62. Align and prompt: Video-and-language pre-training with entity prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4953–4963, 2022.
  63. All-in-one transformer: Unifying speech recognition, audio tagging, and event detection. In INTERSPEECH, pages 3112–3116, 2020.
  64. Maskvit: Masked visual pre-training for video prediction. In The Eleventh International Conference on Learning Representations, 2022.
  65. Clip-vip: Adapting pre-trained image-text model to video-language alignment. In The Eleventh International Conference on Learning Representations, 2022.
  66. Revealing single frame bias for video-and-language learning. arXiv preprint arXiv:2206.03428, 2022.
  67. Long-form video-language pre-training with multimodal temporal contrastive learning. Advances in neural information processing systems, 35:38032–38045, 2022.
  68. Expectation-maximization contrastive learning for compact video-and-language representations. Advances in Neural Information Processing Systems, 35:30291–30306, 2022.
  69. Hitea: Hierarchical temporal-aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416, 2023.
  70. Champagne: Learning real-world conversation from large-scale web videos. arXiv preprint arXiv:2303.09713, 2023.
  71. Gpt-4v (ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547, 2023.
  72. Dnagpt: A generalized pretrained tool for multiple dna sequence analysis tasks. bioRxiv, pages 2023–07, 2023.
  73. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  74. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  75. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  76. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 1, 2023.
  77. A review of deep learning for video captioning. arXiv preprint arXiv:2304.11431, 2023.
  78. A comprehensive study of deep video action recognition. arXiv preprint arXiv:2012.06567, 2020.
  79. A survey on video diffusion models. arXiv preprint arXiv:2310.10647, 2023.
  80. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  81. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  82. Chatvideo: A tracklet-centric multimodal and versatile video understanding system. arXiv preprint arXiv:2304.14407, 2023.
  83. showlab. Vlog: Transform video as a document with chatgpt, clip, blip2, grit, whisper, langchain. https://github.com/showlab/VLog. Accessed: 2023-12-23.
  84. Mm-vid: Advancing video understanding with gpt-4v (ision). arXiv preprint arXiv:2310.19773, 2023.
  85. Mm-narrator: Narrating long-form videos with multimodal in-context learning. arXiv preprint arXiv:2311.17435, 2023.
  86. Misar: A multimodal instructional system with augmented reality. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1–5, October 2023.
  87. Learning video representations from large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
  88. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023.
  89. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset. arXiv preprint arXiv:2305.18500, 2023.
  90. Merlin:empowering multimodal llms with foresight minds, 2023.
  91. Pg-video-llava: Pixel grounding large video-language models. arXiv preprint arXiv:2311.13435, 2023.
  92. Vtimellm: Empower llm to grasp video moments. arXiv preprint arXiv:2311.18445, 2023.
  93. Gpt4video: A unified multimodal large language model for lnstruction-followed understanding and safety-aware generation. arXiv preprint arXiv:2311.16511, 2023.
  94. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
  95. Llmva-gebc: Large language model with video adapter for generic event boundary captioning. arXiv preprint arXiv:2306.10354, 2023.
  96. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
  97. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  98. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023.
  99. Autoad ii: The sequel-who, when, and what in movie audio description. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13645–13655, 2023.
  100. mplug-2: A modularized multi-modal foundation model across text, image and video. arXiv preprint arXiv:2302.00402, 2023.
  101. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023.
  102. Large language models are temporal and causal reasoners for video question answering. arXiv preprint arXiv:2310.15747, 2023.
  103. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  104. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  105. Videollm: Modeling video sequence with large language models. arXiv preprint arXiv:2305.13292, 2023.
  106. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  107. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  108. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  109. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, 2023.
  110. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  111. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  112. Albert S Bregman. Auditory scene analysis: The perceptual organization of sound. MIT press, 1994.
  113. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  114. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  115. Video chatcaptioner: Towards the enriched spatiotemporal descriptions. arXiv preprint arXiv:2304.04227, 2023.
  116. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  117. Youku-mplug: A 10 million large-scale chinese video-language dataset for pre-training and benchmarks. arXiv preprint arXiv:2306.04362, 2023.
  118. Fine-grained audio-visual joint representations for multimodal large language models, 2023.
  119. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023.
  120. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
  121. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pages 190–200, 2011.
  122. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738, 2021.
  123. Grit: A generative region-to-text transformer for object understanding. arXiv preprint arXiv:2212.00280, 2022.
  124. Organization or Author Name. Gpt-4v: An overview. https://website.com/path-to-gpt-4v, 2023. Accessed: 2023-xx-xx.
  125. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  126. Language models are unsupervised multitask learners. 2019.
  127. https://cloud.google.com/speech-to-text.
  128. https://cloud.google.com/text-to-speech.
  129. In the eye of beholder: Joint learning of gaze and actions in first person video. In Proceedings of the European conference on computer vision (ECCV), pages 619–635, 2018.
  130. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626, 2018.
  131. Towards automatic learning of procedures from web instructional videos. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  132. Multimodal pretraining for dense video captioning. arXiv preprint arXiv:2011.11760, 2020.
  133. Tvr: A large-scale dataset for video-subtitle moment retrieval. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 447–463. Springer, 2020.
  134. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv preprint arXiv:2304.08345, 2023.
  135. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4581–4591, 2019.
  136. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  137. Clotho: An audio captioning dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 736–740. IEEE, 2020.
  138. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019.
  139. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  140. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.
  141. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  142. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383, 2019.
  143. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019.
  144. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
  145. Dancetrack: Multi-object tracking in uniform appearance and diverse motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20993–21002, 2022.
  146. Sompt22: A surveillance oriented multi-pedestrian tracking dataset. In European Conference on Computer Vision, pages 659–675. Springer, 2022.
  147. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  148. From recognition to cognition: Visual commonsense reasoning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6720–6731, 2019.
  149. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  150. Multisports: A multi-person video dataset of spatio-temporally localized sports actions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13536–13545, 2021.
  151. Titan: Future forecast using action priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11186–11196, 2020.
  152. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2), 2021.
  153. Noise stability regularization for improving bert fine-tuning. arXiv preprint arXiv:2107.04835, 2021.
  154. Fine-tuning pre-trained language models with noise stability regularization. arXiv preprint arXiv:2206.05658, 2022.
  155. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  156. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  157. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  158. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  159. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653, 2017.
  160. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 9127–9134, 2019.
  161. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  162. Memecap: A dataset for captioning and interpreting memes. arXiv preprint arXiv:2305.13703, 2023.
  163. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  164. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  165. Geb+: A benchmark for generic event boundary captioning, grounding and retrieval. In European Conference on Computer Vision, pages 709–725. Springer, 2022.
  166. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  167. Omnivore: A single model for many visual modalities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16102–16112, 2022.
  168. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588, 2021.
  169. Opt: Open pre-trained transformer language models. arXiv:2205.01068, 2022.
  170. Language models are few-shot learners. NeurIPS, 2020.
  171. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022.
  172. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
  173. Tgif: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4641–4650, 2016.
  174. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2758–2766, 2017.
  175. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  176. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  177. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  178. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022.
  179. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
  180. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017.
  181. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017.
  182. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2016.
  183. Youtube-8m: A large-scale video classification benchmark, 2016.
  184. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019.
  185. Mad: A scalable dataset for language grounding in videos from movie audio descriptions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5026–5035, 2022.
  186. Movienet: A holistic dataset for movie understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pages 709–727. Springer, 2020.
  187. Tvsum: Summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5179–5187, 2015.
  188. Creating summaries from user videos. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VII 13, pages 505–520. Springer, 2014.
  189. Videoxum: Cross-modal visual and textural summarization of videos. arXiv preprint arXiv:2303.12060, 2023.
  190. The epic-kitchens dataset: Collection, challenges and baselines. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(11):4125–4141, 2020.
  191. Ego4d: Around the world in 3,000 hours of egocentric video. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18973–18990, 2021.
  192. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR, 2020.
  193. Tall: Temporal activity localization via language query. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  194. Weakly supervised video moment retrieval from text queries. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11592–11601, 2019.
  195. Deepstory: Video story qa by deep embedded memory networks. arXiv preprint arXiv:1707.00836, 2017.
  196. Tvqa: Localized, compositional video question answering. In EMNLP, 2018.
  197. Vidchapters-7m: Video chapters at scale. arXiv preprint arXiv:2309.13952, 2023.
  198. Large language models know your contextual search intent: A prompting framework for conversational search. arXiv preprint arXiv:2303.06573, 2023.
  199. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, pages 105–124. Springer, 2022.
  200. Text-video retrieval with disentangled conceptualization and set-to-set alignment. arXiv preprint arXiv:2305.12218, 2023.
  201. Diffusionret: Generative text-video retrieval with diffusion model. arXiv preprint arXiv:2303.09867, 2023.
  202. Launchpadgpt: Language model as music visualization designer on launchpad. arXiv preprint arXiv:2307.04827, 2023.
  203. Egovlpv2: Egocentric video-language pre-training with fusion in the backbone. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5285–5297, October 2023.
  204. Multi-modal segment assemblage network for ad video editing with importance-coherence reward. In Proceedings of the Asian Conference on Computer Vision, pages 3519–3535, 2022.
  205. Large language models in education: Vision and opportunities. arXiv preprint arXiv:2311.13160, 2023.
  206. A survey on deep multi-modal learning for body language recognition and generation. arXiv preprint arXiv:2308.08849, 2023.
  207. Machine translation from signed to spoken languages: State of the art and challenges. Universal Access in the Information Society, pages 1–27, 2023.
  208. Manish Kumar Mishra. Generating video game quests from stories. Master’s thesis, University of Twente, 2023.
  209. SB Koomen. Text generation for quests in multiplayer role-playing video games. Master’s thesis, University of Twente, 2023.
  210. Vishvesh Soni. Large language models for enhancing customer lifecycle management. Journal of Empirical Social Science Studies, 7(1):67–89, 2023.
  211. Analysis of language-model-powered chatbots for query resolution in pdf-based automotive manuals. Vehicles, 5(4):1384–1399, 2023.
  212. Nida Gokce Narin. The role of artificial intelligence and robotic solution technologies in metaverse design. In Metaverse: Technologies, Opportunities and Threats, pages 45–63. Springer, 2023.
  213. Timothy Jung and M Claudia tom Dieck. XR-Metaverse Cases: Business Application of AR, VR, XR and Metaverse. Springer Nature, 2023.
  214. Caption anything: Interactive image description with diverse multimodal controls. arXiv preprint arXiv:2305.02677, 2023.
  215. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699, 2022.
  216. Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning. In Conference on Robot Learning, pages 23–72. PMLR, 2023.
  217. Gunther Eysenbach et al. The role of chatgpt, generative language models, and artificial intelligence in medical education: a conversation with chatgpt and a call for papers. JMIR Medical Education, 9(1):e46885, 2023.
  218. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023.
  219. Chatgpt for cybersecurity: practical applications, challenges, and future directions. Cluster Computing, 26(6):3421–3436, 2023.
  220. Modelling language for cyber security incident handling for critical infrastructures. Computers & Security, 128:103139, 2023.
  221. Lanobert: System log anomaly detection based on bert masked language model. Applied Soft Computing, 146:110689, 2023.
  222. Logfit: Log anomaly detection using fine-tuned language models. 2023.
  223. Socratic video understanding on unmanned aerial vehicles. Procedia Computer Science, 225:144–154, 2023.
  224. Graphgpt: Graph instruction tuning for large language models. arXiv preprint arXiv:2310.13023, 2023.
  225. Drive as you speak: Enabling human-like interaction with large language models in autonomous vehicles. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 902–909, 2024.
  226. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (20)
  1. Yunlong Tang (32 papers)
  2. Jing Bi (26 papers)
  3. Siting Xu (3 papers)
  4. Luchuan Song (21 papers)
  5. Susan Liang (24 papers)
  6. Teng Wang (92 papers)
  7. Daoan Zhang (24 papers)
  8. Jie An (36 papers)
  9. Jingyang Lin (16 papers)
  10. Rongyi Zhu (10 papers)
  11. Ali Vosoughi (18 papers)
  12. Chao Huang (244 papers)
  13. Zeliang Zhang (34 papers)
  14. Feng Zheng (117 papers)
  15. Jianguo Zhang (97 papers)
  16. Ping Luo (340 papers)
  17. Jiebo Luo (355 papers)
  18. Chenliang Xu (114 papers)
  19. Pinxin Liu (18 papers)
  20. Mingqian Feng (14 papers)
Citations (44)
X Twitter Logo Streamline Icon: https://streamlinehq.com