PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Abstract: Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA
- Pyscenedetect. https://github.com/Breakthrough/PySceneDetect, 2023.
- Whisperx: Time-accurate speech transcription of long-form audio. arXiv preprint arXiv:2303.00747, 2023.
- Bridging the gap between object and image-level representations for open-vocabulary detection. In NeurIPS, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Tracking anything with decoupled video segmentation. In ICCV, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv:2304.15010, 2023.
- Whisper-at: Noise-robust automatic speech recognizers are also strong audio event taggers. In Proc. Interspeech 2023, 2023.
- Activitynet: A large-scale video benchmark for human activity understanding. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
- Segment anything. arXiv:2304.02643, 2023.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
- Videochat: Chat-centric video understanding. arXiv:2305.06355, 2023b.
- TGIF: A New Dataset and Benchmark on Animated GIF Description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Open-vocabulary semantic segmentation with mask-adapted clip. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7061–7070, 2023.
- Visual instruction tuning. ArXiv, abs/2304.08485, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv:2303.05499, 2023c.
- Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023d.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv:2306.05424, 2023.
- Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, pages 1–18. Springer, 2022.
- OpenAI. Whisper. https://openai.com/research/whisper, 2022.
- OpenAI. Gpt-4v(ision) system card. https://openai.com/research/gpt-4v-system-card, 2023a.
- OpenAI. Chatgpt: Large language model for human-style conversation. https://chat.openai.com, 2023b.
- OpenLMLab. MOSS: Codebase for MOSS Project. https://github.com/OpenLMLab/MOSS, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023a.
- Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023b.
- Language-grounded indoor 3d semantic segmentation in the wild. In European Conference on Computer Vision, pages 125–141. Springer, 2022.
- Human-centric spatio-temporal video grounding with visual transformers. IEEE Transactions on Circuits and Systems for Video Technology, 2021.
- Stanford alpaca: An instruction-following llama model, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Actionclip: A new paradigm for video action recognition. arXiv preprint arXiv:2109.08472, 2021.
- The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023.
- Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, 2017.
- Msr-vtt: A large video description dataset for bridging video and language. 2016.
- Zero-shot video question answering via frozen bidirectional language models. In NeurIPS, 2022.
- mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
- Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704, 2023.
- Activitynet-qa: A dataset for understanding complex web videos via question answering. In AAAI, pages 9127–9134, 2019.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv:2306.02858, 2023a.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Recognize anything: A strong image tagging model. arXiv preprint arXiv:2306.03514, 2023b.
- Where does it exist: Spatio-temporal video grounding for multi-form sentences. In CVPR, 2020.
- Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.