PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning (2404.16994v2)

Published 25 Apr 2024 in cs.CV

Abstract: Vision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the progress of video-LLMs. This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for dense video understanding. Our preliminary experiments reveal that directly fine-tuning pre-trained image-LLMs with multiple frames as inputs on video datasets leads to performance saturation or even a drop. Our further investigation reveals that it is largely attributed to the bias of learned high-norm visual features. Motivated by this finding, we propose a simple but effective pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from the extreme features. The new model is termed Pooling LLaVA, or PLLaVA in short. PLLaVA achieves new state-of-the-art performance on modern benchmark datasets for both video question-answer and captioning tasks. Notably, on the recent popular VideoChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, exceeding the previous SOTA results from GPT4V (IG-VLM) by 9%. On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, 14.5% higher than GPT4V (IG-VLM). Code is available at https://pllava.github.io/

PDF HTML Abstract

Enhancing Dense Video Understanding with Pooling Strategy in LLMs

Introduction and Motivation

The adaptation of Image-LLMs (MLLMs) to the video domain presents unique challenges, primarily due to the inherent complexity and resource demands of video data. Conventional approaches often struggle with computational efficiency and require extensive data annotation. This research paper introduces a novel methodology, Pooling LLaVA (PLLaVA), which leverages a pooling strategy to adapt pre-trained image-LLMs for enhanced video understanding. The proposed method overcomes the limitations of direct frame feature fine-tuning and sets new performance benchmarks on video question-answer and captioning tasks.

Key Findings and Methodology

PLLaVA introduces a simple yet effective pooling operation to manage the temporal dimension of video data, addressing the issue of biased and high-norm visual features that hamper model performance. This pooling strategy not only maintains the richness of frame-level information but also significantly reduces computational overhead.

Technical Challenges Identified

Direct application of image MLLMs to video tasks using multiple frames as inputs often leads to performance saturation or decline.
A fine-tuning approach that incorporates multiple video frames frequently results in a bias towards dominant high-norm visual features, leading to reduced descriptive quality and length.

Pooling Strategy

The pooling approach is designed to smooth the feature distribution along the temporal dimension, minimizing the influence of extreme features and enhancing the overall video description capability of the model.
PLLaVA utilizes an adaptive pooling module that efficiently condenses the video features without sacrificing critical spatial or temporal information, facilitating a more robust understanding of video content.

Experimental Validation

PLLaVA's effectiveness is demonstrated through extensive experiments across multiple standard video understanding benchmarks. Notably, it surpasses previous state-of-the-art models on the Video ChatGPT benchmark by significant margins, achieving superior performance in detailed video captioning and question-answering tasks.

Key Results

On the Video ChatGPT benchmark, PLLaVA achieved a score of 3.48 out of 5 across different evaluation dimensions, exceeding the previous SOTA by 9%.
In the MVBench multi-choice question answering benchmark, PLLaVA achieved an average accuracy of 58.1% across 20 sub-tasks, marking a 14.5% improvement over the nearest competitor.

Implications and Future Work

The success of PLLaVA in handling dense video understanding tasks indicates a promising direction for further explorations in video-LLM training. The pooling strategy effectively addresses the challenge of feature dominance and opens new avenues for efficient video data processing within the constraints of current computational resources. Future studies may explore the adaptability of the pooling approach to different types of video content and its integration with other multimodal training frameworks.

Conclusion

PLLaVA represents a significant step forward in the adaptation of image-LLMs for video understanding tasks. By introducing an efficient pooling strategy, this model not only achieves new benchmarks in video question-answering and captioning but also enhances the model's ability to handle dense and complex video data. This research provides a solid foundation for future advancements in video-LLMing, promoting deeper and more efficient multimodal interactions.