- The paper introduces VISTA, a framework using seven novel augmentation techniques to synthesize long-duration and high-resolution video-QA training data for large multimodal models.
- Fine-tuning models on the VISTA-400K dataset and evaluating on HRVideoBench shows significant performance improvements in understanding complex videos.
- VISTA provides a scalable method to create high-quality training data, addressing the scarcity that limits the capability of video LMMs in handling detailed and extended video content.
The paper "VISTA: Enhancing Long-Duration and High-Resolution Video Understanding by Video SpatioTemporal Augmentation" tackles the limitations predominant in large multimodal models (LMMs) concerning their processing of long-duration and high-resolution videos. These limitations largely stem from the unavailability of quality datasets necessary for such domains. To mitigate this, the paper introduces a specialized framework known as VISTA (VIdeo SpatioTemporal Augmentation) that synthesizes instructional video pairs with augmented durations and resolutions by leveraging existing video-caption datasets.
The VISTA framework employs seven novel video augmentation techniques to spatially and temporally combine video clips. This results in synthetic videos that possess extended durations and enhanced resolutions. Following video synthesis, VISTA generates question-answer (QA) pairs to create enriched training data. As a concrete application, the authors curate VISTA-400K, a dataset consisting of such video instruction pairs. Evaluation of multiple video LMMs fine-tuned on VISTA-400K demonstrates significant improvement, with an average performance uplift of 3.3% across four benchmarks focused on long-video comprehension. Additionally, a new benchmark, HRVideoBench, was developed to test high-resolution video understanding, where VISTA-tuned models showed a 6.5% improvement.
Key Contributions:
- VISTA-400K Dataset: A high-quality, synthetic video dataset aimed at enhancing the comprehension of long and high-resolution videos. It includes diverse QA pairs to assess and improve the proficiency of video LMMs in handling long-duration and high-resolution content.
- HRVideoBench: The first comprehensive benchmark dedicated to high-resolution video understanding. This benchmark challenges models to recognize fine object details and subtle actions within high-resolution video content.
- Video Augmentation Methods: Seven methodologies for generating extended and high-resolution video samples from existing datasets. These augmentations aim to strengthen models' capabilities in both temporal and spatial video understanding domains.
Experiments illustrate that models fine-tuned on VISTA-400K outperform their vanilla counterparts. Particularly, these models perform robustly on newly introduced benchmarks focused on challenging tasks such as recognizing small objects and nuanced actions within rich contextual videos. An ablation paper further highlights that disabling the proposed augmentations results in a significant degradation of model performance.
In summary, VISTA provides a scalable framework for augmenting existing video datasets, facilitating the synthesis of long and high-resolution video data that is crucial for advancing the capabilities of multimodal video understanding systems.