Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs (2402.13546v2)

Published 21 Feb 2024 in cs.CL and cs.CV

Abstract: Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence. Employing LLMs for comprehending video becomes an emerging and promising method. However, this approach incurs high computational costs due to the extensive array of video tokens, experiences reduced visual clarity as a consequence of token aggregation, and confronts challenges arising from irrelevant visual tokens while answering video-related questions. To alleviate these issues, we present an Interactive Visual Adapter (IVA) within LLMs, designed to enhance interaction with fine-grained visual elements. Specifically, we first transform long videos into temporal video tokens via leveraging a visual encoder alongside a pretrained causal transformer, then feed them into LLMs with the video instructions. Subsequently, we integrated IVA, which contains a lightweight temporal frame selector and a spatial feature interactor, within the internal blocks of LLMs to capture instruction-aware and fine-grained visual signals. Consequently, the proposed video-LLM facilitates a comprehensive understanding of long video content through appropriate long video modeling and precise visual interactions. We conducted extensive experiments on nine video understanding benchmarks and experimental results show that our interactive visual adapter significantly improves the performance of video LLMs on long video QA tasks. Ablation studies further verify the effectiveness of IVA in understanding long and short video.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6836–6846.
  2. Frozen in time: A joint video and image encoder for end-to-end retrieval. In IEEE International Conference on Computer Vision.
  3. LifeQA: A real-life dataset for video question answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4352–4358, Marseille, France. European Language Resources Association.
  4. In-the-wild video question answering. In COLING, pages 5613–5635, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  6. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500.
  7. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  8. Multiscale vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6824–6835.
  9. Masked autoencoders as spatiotemporal learners. Advances in neural information processing systems, 35:35946–35958.
  10. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1933–1941.
  11. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010.
  12. Omnimae: Single model masked pretraining on images and videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10406–10417.
  13. Maskvit: Masked visual pre-training for video prediction. In The Eleventh International Conference on Learning Representations.
  14. Champagne: Learning real-world conversation from large-scale web videos. arXiv preprint arXiv:2303.09713.
  15. Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2472–2482.
  16. Large-scale video classification with convolutional neural networks. In CVPR.
  17. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  18. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  19. Mimic-it: Multi-modal in-context instruction tuning.
  20. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. ICML.
  22. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
  23. Vidtr: Video transformer without convolutions. arXiv e-prints, pages arXiv–2104.
  24. Lmeye: An interactive perception network for large language models. arXiv preprint arXiv:2305.03701.
  25. Towards vision enhancing llms: Empowering multimodal knowledge storage and sharing in llms. arXiv preprint arXiv:2311.15759.
  26. Visual instruction tuning. arXiv preprint arXiv:2304.08485.
  27. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207.
  28. Gpt-4v (ision) as a social media analysis engine. arXiv preprint arXiv:2311.07547.
  29. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424.
  30. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  31. Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. 2023. Video-chatgpt: Towards detailed video understanding via large vision and language models. ArXiv 2306.05424.
  32. OpenAI. 2023. Chatgpt. OpenAI Blog.
  33. Training language models to follow instructions with human feedback. arXiv:2203.02155.
  34. Tool learning with foundation models. arXiv preprint arXiv:2304.08354.
  35. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449.
  36. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7464–7473.
  37. Long-form video-language pre-training with multimodal temporal contrastive learning. Advances in neural information processing systems, 35:38032–38045.
  38. Video understanding with large language models: A survey. arXiv preprint arXiv:2312.17432.
  39. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497.
  40. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer.
  41. Social-iq 2.0 challenge: Benchmarking multimodal social understanding. https://github.com/abwilf/Social-IQ-2.0-Challenge.
  42. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
  43. Next-qa: Next phase of question-answering to explaining temporal actions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9777–9786.
  44. Video question answering via gradually refined attention over appearance and motion. In Proceedings of the 25th ACM international conference on Multimedia, pages 1645–1653.
  45. Retrieval-based video language model for efficient long video question answering. arXiv preprint arXiv:2312.04931.
  46. Clip-vip: Adapting pre-trained image-text model to video-language alignment. In The Eleventh International Conference on Learning Representations.
  47. Hitea: Hierarchical temporal-aware video-language pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15405–15416.
  48. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134.
  49. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4694–4702.
  50. Dnagpt: A generalized pretrained tool for multiple dna sequence analysis tasks. bioRxiv, pages 2023–07.
  51. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
  52. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  53. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  54. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8746–8755.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yunxin Li (29 papers)
  2. Xinyu Chen (65 papers)
  3. Baotain Hu (1 paper)
  4. Min Zhang (630 papers)