Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding (2407.15754v1)

Published 22 Jul 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. To achieve this, we interpret the primary challenge as to accurately retrieve and reason over detailed multimodal information from long inputs. As such, we formulate a novel video question-answering task termed referring reasoning. Specifically, as part of the question, it contains a referring query that references related video contexts, called referred context. The model is then required to reason over relevant video details from the referred context. Following the paradigm of referring reasoning, we curate 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that the LongVideoBench presents significant challenges even for the most advanced proprietary models (e.g. GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), while their open-source counterparts show an even larger performance gap. In addition, our results indicate that model performance on the benchmark improves only when they are capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Llama: Open and efficient foundation language models, 2023.
  2. OpenAI. Hello gpt-4o. https://openai.com/index/hello-gpt-4o/, May 2024a.
  3. Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
  4. Ruler: What’s the real context size of your long-context language models?, 2024.
  5. Ada-leval: Evaluating long-context llms with length-adaptable benchmarks, 2024a.
  6. gkamradt. Llmtest_needleinahaystack, 2024. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack.
  7. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
  8. Activitynet-qa: A dataset for understanding complex web videos via question answering. arXiv preprint arXiv:1906.02467, 2019. URL https://arxiv.org/abs/1906.02467.
  9. Next-qa:next phase of question-answering to explaining temporal actions, 2021.
  10. Mv-bench: A benchmark for multi-view video understanding. arXiv preprint arXiv:2308.12345, 2023. URL https://arxiv.org/abs/2308.12345.
  11. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/90ce332aff156b910b002ce4e6880dec-Abstract-Datasets_and_Benchmarks.html. arXiv preprint arXiv:2308.09126.
  12. Moviechat: From dense token to sparse memory for long video understanding. arXiv preprint arXiv:2307.16449, 2023. URL https://arxiv.org/abs/2307.16449.
  13. Movqa: A benchmark of versatile question-answering for long-form movie understanding. arXiv preprint arXiv:2312.04817, 2023a. URL https://arxiv.org/pdf/2312.04817.
  14. OpenAI. Whisper-v3-large: A large scale multimodal model. https://github.com/openai/whisper-v3-large, 2024b.
  15. Q-align: Teaching lmms for visual scoring via discrete text-defined levels. In ICML2024, 2024.
  16. Internvideo2: Scaling video foundation models for multimodal video understanding, 2024b.
  17. Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023b.
  18. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023c. URL https://arxiv.org/abs/2306.05424.
  19. Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024.
  20. Videochat: Chat-centric video understanding, 2023a.
  21. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  22. Improved baselines with visual instruction tuning. CoRR, abs/2310.03744, 2023a.
  23. mPLUG-Owl2: Revolutionizing multi-modal large language model with modality collaboration. CoRR, abs/2311.04257, 2023.
  24. Llama-vid: An image is worth 2 tokens in large language models, 2023c.
  25. Ma-lmm: Memory-augmented large multimodal model for long-term video understanding, 2024.
  26. Koala: Key frame-conditioned long video-llm, 2024.
  27. World model on million-length video and language with blockwise ringattention, 2024a.
  28. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
  29. What matters when building vision-language models?, 2024.
  30. Huggingface. Introducing idefics: An open reproduction of state-of-the-art visual language model, 2023. URL https://huggingface.co/blog/idefics.
  31. Mistral 7b, 2023.
  32. Mantis: Interleaved multi-image instruction tuning, 2024.
  33. SkunkworksAI. BakLLaVA, 2024. URL https://github.com/SkunkworksAI/BakLLaVA.
  34. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. CoRR, abs/2305.06500, 2023.
  35. Visual instruction tuning. CoRR, abs/2304.08485, 2023b.
  36. Lava-next: Improved reasoning, ocr, and world knowledge, 2024b. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  37. Video-llava: Learning united visual representation by alignment before projection, 2023.
  38. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  39. Sharegpt4video: Improving video understanding and generation with better captions, 2024. URL https://arxiv.org/abs/2406.04325.
  40. Llava-next: A strong zero-shot video understanding model, 2024. URL https://llava-vl.github.io/blog/2024-04-30-llava-next-video/.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haoning Wu (68 papers)
  2. Dongxu Li (40 papers)
  3. Bei Chen (56 papers)
  4. Junnan Li (56 papers)
Citations (27)