Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models (2311.13435v2)

Published 22 Nov 2023 in cs.CV and cs.AI

Abstract: Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA

PG-Video-LLaVA: A Novel Approach to Video-LLMs with Pixel-Level Grounding

The paper "PG-Video-LLaVA: Pixel Grounding Large Video-LLMs" introduces a new methodology, PG-Video-LLaVA, expanding the capabilities of image-based Large Multimodal Models (LMMs) to video data. This innovation focuses on integrating pixel-level grounding abilities into video-LLMs, alongside incorporating audio cues to enhance overall video comprehension. Unlike existing models such as VideoChat, Video-LLaMA, or Video-ChatGPT, which either lack grounding capabilities or do not fully exploit audio signals, PG-Video-LLaVA stands out by spatially localizing objects in videos as per user instructions.

The proposed model builds on the foundation of the LLaVA-1.5 framework, complemented by spatio-temporal encodings derived from a CLIP-based visual encoder specifically adapted for video interpretation. This adaptation facilitates a thorough understanding of video content by averaging frame-level features over temporal and spatial dimensions and aligning these features with a LLM through a learnable Multi-Layer Perceptron (MLP). These enhancements offer the model notable gains in performance on video-based conversation and grounding tasks.

An important aspect of PG-Video-LLaVA is its modular design, enabling seamless integration with future iterations of visual grounding technologies. The effective incorporation of audio context is achieved by transcribing audio into text format using advanced transcription techniques akin to Whisper. This multimodal approach ensures the model can process and leverage auditory information, particularly useful in scenarios involving rich audio content such as dialogues or video conferences.

In terms of evaluation, the authors adopt a comprehensive benchmarking approach, focusing on video-based generative and question-answering performances. Notably, they introduce new benchmarks, specifically targeting the evaluation of prompt-based object grounding in videos. The adoption of Vicuna over proprietary models like GPT-3.5 ensures reproducibility and transparency, crucial for ongoing research integrity.

The experimental results highlight PG-Video-LLaVA's strong performance in both qualitative and quantitative assessments. On video conversation benchmarks, the model consistently outperforms earlier iterations, particularly excelling in contextual and temporal understanding, as demonstrated by improved scores in metrics like correctness, detail orientation, and consistency.

In addition to conversation capabilities, PG-Video-LLaVA demonstrates robust performance in spatial grounding tasks. Utilizing datasets such as Vid-STG and HC-STVG for benchmarking, the model shows superior pixel-level grounding abilities. This is crucial as it aligns video elements more closely with semantic descriptions, enabling the model to generate more accurate and contextually relevant outputs, which is particularly evident when tested in zero-shot settings on established QA datasets like MSRVTT-QA and MSVD-QA.

The implications of PG-Video-LLaVA are multifaceted. Practically, the incorporation of pixel-level grounding and audio inputs can greatly enhance interactive and intelligent video applications, from advanced virtual assistants to automated video editing tools. Theoretically, this advancement sets a precedent for future developments in video-LLMs, encouraging the exploration of even richer multimodal datasets and grounding techniques.

Overall, PG-Video-LLaVA represents a sophisticated advancement in the field of LMMs for videos, striking a balance between innovative feature integration and practical applicability. Its techniques and results not only push the envelope in video language processing but also pave the way for subsequent research to build on a foundation that emphasizes comprehensive, multimodal understanding of complex video data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Pyscenedetect. https://github.com/Breakthrough/PySceneDetect, 2023.
  2. Whisperx: Time-accurate speech transcription of long-form audio. arXiv preprint arXiv:2303.00747, 2023.
  3. Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, 2022.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  6. Tracking anything with decoupled video segmentation. In ICCV, 2023.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  9. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv:2304.15010, 2023.
  10. Whisper-at: Noise-robust automatic speech recognizers are also strong audio event taggers. In Proc. Interspeech 2023, 2023.
  11. Activitynet: A large-scale video benchmark for human activity understanding. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
  12. Segment anything. arXiv:2304.02643, 2023.
  13. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  14. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  15. Videochat: Chat-centric video understanding. arXiv:2305.06355, 2023b.
  16. TGIF: A New Dataset and Benchmark on Animated GIF Description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  17. Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
  18. Visual instruction tuning. ArXiv, abs/2304.08485, 2023a.
  19. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  20. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499, 2023c.
  21. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023d.
  22. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023.
  23. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, pages 1–18. Springer, 2022.
  24. OpenAI. Whisper. https://openai.com/research/whisper, 2022.
  25. OpenAI. Gpt-4v(ision) system card. https://openai.com/research/gpt-4v-system-card, 2023a.
  26. OpenAI. Chatgpt: Large language model for human-style conversation. https://chat.openai.com, 2023b.
  27. OpenLMLab. MOSS: Codebase for MOSS Project. https://github.com/OpenLMLab/MOSS, 2023.
  28. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  29. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  30. Learning transferable visual models from natural language supervision. In ICML, 2021.
  31. Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023a.
  32. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023b.
  33. Language-grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022.
  34. Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
  35. Stanford alpaca: An instruction-following llama model, 2023.
  36. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  37. Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
  38. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023.
  39. Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
  40. Msr-vtt: A large video description dataset for bridging video and language. 2016.
  41. Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
  42. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  43. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
  44. Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019.
  45. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023a.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  47. Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023b.
  48. Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR, 2020.
  49. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
  50. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shehan Munasinghe (4 papers)
  2. Rusiru Thushara (2 papers)
  3. Muhammad Maaz (23 papers)
  4. Hanoona Abdul Rasheed (3 papers)
  5. Salman Khan (244 papers)
  6. Mubarak Shah (207 papers)
  7. Fahad Khan (24 papers)
Citations (24)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com