Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (2311.17043v1)

Published 28 Nov 2023 in cs.CV and cs.CL
LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

Abstract: In this work, we present a novel method to tackle the token generation challenge in Vision LLMs (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID

LLaMA-VID: An Image is Worth 2 Tokens in LLMs

The paper introduces LLaMA-VID, a novel methodology aimed at optimizing token generation within Vision LLMs (VLMs) to enhance video and image comprehension. This approach addresses a significant challenge in current VLM architectures—specifically, the computational burden linked to processing extensive visual tokens in long video sequences. By leveraging a dual-token strategy, LLaMA-VID efficiently condenses video frames into two tokens, significantly enhancing computational efficiency while preserving critical information.

Framework and Methodology

LLaMA-VID innovatively utilizes two types of tokens: a context token and a content token. The context token encapsulates the overall context of an image or video frame, guided by user input, whereas the content token retains detailed visual cues. The distinction between these tokens allows the framework to compress information effectively, supporting the processing of hour-long videos.

For the generation of these tokens, LLaMA-VID integrates a visual encoder and a text decoder, utilizing cutting-edge transformer-based architectures such as ViT and QFormer. The context token is derived using context attention, a mechanism that aggregates text-related visual features, allowing the model to condense broad information efficiently into a single token. This approach ensures that the most pertinent visual cues are maintained, significantly reducing the number of tokens needed for each frame of a prolonged video sequence.

Experimental Results

LLaMA-VID demonstrated its efficacy through extensive empirical evaluations, outperforming preceding methods across numerous video- and image-based benchmarks. In video-based zero-shot QA datasets, such as MSVD-QA and MSRVTT-QA, the proposed method achieved superior performance, showcasing its potential in handling video data with minimal tokens. Notably, this efficiency does not come at the cost of accuracy or visual comprehension, as evidenced by its leading scores in both video summarization and detailed reasoning tasks.

With image-based inputs, LLaMA-VID also shows promise by extending the upper limit of VLMs through the novel utilization of context tokens. The results indicate considerable improvements across a range of visual question answering and understanding benchmarks, highlighting the generality and robustness of the proposed approach.

Implications and Future Work

LLaMA-VID's ability to significantly compress video content into minimal tokens without sacrificing critical information has important implications for the practical deployment of VLMs in real-world applications, such as video analytics and multimedia content understanding. This advancement is crucial for scenarios requiring the efficient processing of extensive datasets, which are common in industrial settings.

Theoretically, LLaMA-VID contributes to the growing body of research on efficient data representation in large-scale AI systems. By demonstrating the feasibility of such a dual-token strategy, it opens avenues for exploring further token optimization techniques and their impact on other domains of AI.

Future developments may explore the dynamic adaptability of token compression levels, allowing models to adjust based on resource availability and task complexity. Additionally, the integration of more nuanced user instructions could further refine the context token's efficacy, enhancing its precision in applications where understanding context-specific cues is imperative.

In summary, LLaMA-VID presents a sophisticated approach to token generation in VLMs, providing meaningful advancements in both computational efficiency and comprehensive understanding of visual content. The strategic design and empirical validation position it as a significant contribution to the field of AI-driven video and image comprehension.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Sharegpt. https://sharegpt.com/, 2023.
  2. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  3. Anthropic. Claude 2. https://www.anthropic.com/index/claude-2, 2023.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv:2308.12966, 2023.
  5. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
  6. Language models are few-shot learners. In NeurIPS, 2020.
  7. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, 2015.
  8. Collecting highly parallel data for paraphrase evaluation. In ACL, 2011.
  9. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv:2306.15195, 2023a.
  10. Extending context window of large language models via positional interpolation. arXiv:2306.15595, 2023b.
  11. Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325, 2015.
  12. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023c.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
  14. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv:2305.06500, 2023.
  15. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2018.
  16. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  17. Eva: Exploring the limits of masked visual representation learning at scale. In CVPR, 2023.
  18. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv:2306.13394, 2023.
  19. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  20. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
  21. Movienet: A holistic dataset for movie understanding. In ECCV, 2020.
  22. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  23. IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
  24. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  25. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
  26. Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
  27. Lisa: Reasoning segmentation via large language model. arXiv:2308.00692, 2023.
  28. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv:2307.16125, 2023a.
  29. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv:2301.12597, 2023b.
  30. Videochat: Chat-centric video understanding. arXiv:2305.06355, 2023c.
  31. Evaluating object hallucination in large vision-language models. arXiv:2305.10355, 2023d.
  32. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023a.
  33. Visual instruction tuning. In NeruIPS, 2023b.
  34. One for all: Video conversation is feasible without video instruction tuning. arXiv:2309.15785, 2023c.
  35. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692, 2019.
  36. Mmbench: Is your multi-modal model an all-around player? arXiv:2307.06281, 2023d.
  37. Learn to explain: Multimodal reasoning via thought chains for science question answering. In NeurIPS, 2022.
  38. Valley: Video assistant with large language model enhanced ability. arXiv:2306.07207, 2023.
  39. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023.
  40. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016a.
  41. Generation and comprehension of unambiguous object descriptions. In CVPR, 2016b.
  42. Ocr-vqa: Visual question answering by reading text in images. In ICDAR, 2019.
  43. OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023a.
  44. OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023b.
  45. Training language models to follow instructions with human feedback. In NeurIPS, 2022.
  46. Learning transferable visual models from natural language supervision. In ICML, 2021.
  47. A-okvqa: A benchmark for visual question answering using world knowledge. In ECCV, 2022.
  48. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL, 2018.
  49. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, 2020.
  50. Towards vqa models that can read. In CVPR, 2019.
  51. Roformer: Enhanced transformer with rotary position embedding. arXiv:2104.09864, 2021.
  52. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  53. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023a.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023b.
  55. Attention is all you need. In NeurIPS, 2017.
  56. Finetuned language models are zero-shot learners. arXiv:2109.01652, 2021.
  57. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv:2303.04671, 2023.
  58. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, 2016.
  59. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
  60. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv:2305.18752, 2023a.
  61. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv:2309.17421, 2023b.
  62. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023a.
  63. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv:2303.16199, 2023b.
  64. Opt: Open pre-trained transformer language models. arXiv:2205.01068, 2022.
  65. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Yanwei Li (36 papers)
  2. Chengyao Wang (7 papers)
  3. Jiaya Jia (162 papers)
Citations (131)