Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 131 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 79 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 425 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

LLMVA-GEBC: Large Language Model with Video Adapter for Generic Event Boundary Captioning (2306.10354v1)

Published 17 Jun 2023 in cs.CV and cs.CL

Abstract: Our winning entry for the CVPR 2023 Generic Event Boundary Captioning (GEBC) competition is detailed in this paper. Unlike conventional video captioning tasks, GEBC demands that the captioning model possess an understanding of immediate changes in status around the designated video boundary, making it a difficult task. This paper proposes an effective model LLMVA-GEBC (LLM with Video Adapter for Generic Event Boundary Captioning): (1) We utilize a pretrained LLM for generating human-like captions with high quality. (2) To adapt the model to the GEBC task, we take the video Q-former as an adapter and train it with the frozen visual feature extractors and LLM. Our proposed method achieved a 76.14 score on the test set and won the first place in the challenge. Our code is available at https://github.com/zjr2000/LLMVA-GEBC .

Definition Search Book Streamline Icon: https://streamlinehq.com
References (13)
  1. Omnivore: A single model for many visual modalities. arXiv preprint arXiv:2201.08377, 2022.
  2. Dual-stream transformer for generic event boundary captioning. arXiv preprint arXiv:2207.03038, 2022.
  3. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  4. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  5. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  6. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  7. Geb+: A benchmark for generic event boundary captioning, grounding and text-based retrieval. CoRR, 2022.
  8. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  9. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  10. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  11. Exploiting context information for generic event boundary captioning. arXiv preprint arXiv:2207.01050, 2022.
  12. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5579–5588, 2021.
  13. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Citations (7)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.