Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tarsier: Recipes for Training and Evaluating Large Video Description Models (2407.00634v2)

Published 30 Jun 2024 in cs.CV and cs.LG

Abstract: Generating fine-grained video descriptions is a fundamental challenge in video understanding. In this work, we introduce Tarsier, a family of large-scale video-LLMs designed to generate high-quality video descriptions. Tarsier employs CLIP-ViT to encode frames separately and then uses an LLM to model temporal relationships. Despite its simple architecture, we demonstrate that with a meticulously designed two-stage training procedure, the Tarsier models exhibit substantially stronger video description capabilities than any existing open-source model, showing a $+51.4\%$ advantage in human side-by-side evaluation over the strongest model. Additionally, they are comparable to state-of-the-art proprietary models, with a $+12.3\%$ advantage against GPT-4V and a $-6.7\%$ disadvantage against Gemini 1.5 Pro. When upgraded to Tarsier2 by building upon SigLIP and Qwen2-7B, it further improves significantly with a $+4.8\%$ advantage against GPT-4o. Besides video description, Tarsier proves to be a versatile generalist model, achieving new state-of-the-art results across nine public benchmarks, including multi-choice VQA, open-ended VQA, and zero-shot video captioning. Our second contribution is the introduction of a new benchmark -- DREAM-1K (https://tarsier-vlm.github.io/) for evaluating video description models, consisting of a new challenging dataset featuring videos from diverse sources and varying complexity, along with an automatic method specifically designed to assess the quality of fine-grained video descriptions. We make our models and evaluation benchmark publicly available at https://github.com/bytedance/tarsier.

Citations (16)

Summary

  • The paper presents Tarsier's main contribution as a scalable video-language model that achieves +51.4% preference in human evaluations over open-source baselines.
  • It details a two-stage training process, starting with multi-task video-to-text pre-training followed by instruction tuning for multi-grained video descriptions.
  • The work introduces DREAM-1K and the AutoDQ evaluation method to rigorously assess fine-grained video description capabilities.

An Analytical Review of "Tarsier: Recipes for Training and Evaluating Large Video Description Models"

The paper "Tarsier: Recipes for Training and Evaluating Large Video Description Models" presents a comprehensive framework for developing and evaluating models that are designed to generate detailed video descriptions. At the core of this work is the novel Tarsier family of large-scale video-LLMs (LVLMs), which demonstrate markedly improved performance in generating video descriptions compared to existing open-source models, and even show comparability to some state-of-the-art proprietary solutions such as GPT-4V and Gemini 1.5 Pro.

Tarsier Models and Architecture

The Tarsier models employ a straightforward architecture, utilizing a CLIP-ViT encoder to process frames individually, and a LLM to capture temporal dynamics. This modular approach allows Tarsier to focus on intricate relationships between frames without overcomplicating the model's structure. Notably, the 34 billion parameter variant outperforms its competitors, demonstrating a +51.4% preference in human evaluations over the strongest open-source baseline. These results highlight the model's capacity to deliver detailed descriptions with minimal hallucinations, even for videos with multiple subjects and events.

Recipe for Training

The training procedure for Tarsier adopts a two-staged strategy. Initially, a large-scale, multi-task video-to-text pre-training stage is implemented, exposing the model to diverse video scenarios across different public datasets and in-house created data. This stage is crucial for instilling a robust understanding of dynamic video content. Following this, a focused instruction tuning stage hones the model's skill in generating multi-grained video descriptions, further enhancing its versatility across different benchmarks.

DREAM-1K Benchmark

In tandem with the Tarsier models, the paper introduces DREAM-1K, a benchmark designed to rigorously assess fine-grained video description capabilities. This dataset includes videos of varying complexity sourced from multiple platforms like YouTube and TikTok. DREAM-1K demands that models not only detect subtle actions but also interpret high-level events, posing a significant challenge to existing video-LLMs. To evaluate these descriptions, the paper proposes the AutoDQ method, which relies on event extraction and entailment to measure precision and recall, offering a more nuanced evaluation compared to traditional metrics.

Implications and Future Directions

The implications of this research are expansive, both in theoretical understanding and practical applications. On a theoretical level, the findings suggest that scaling Tarsier models, alongside employing extensive, multi-task pre-training, significantly closes the performance gap between open-source and proprietary models. Practically, this translates to improved video description models capable of applications ranging from automated content labeling to assistive technologies for visually impaired users.

Looking forward, the research presents opportunities for further exploration. Scaling pre-training data and model components, refining instruction following, and enhancing dataset diversity are identified as potential avenues for further improving Tarsier's performance. These efforts could lead to more nuanced understandings and advancements in video-LLMing.

In summary, Tarsier represents a noteworthy progression in the field of video understanding, offering high-quality, nuanced video descriptions while balancing simplicity in model design. This paper sets a foundation for future endeavors and provides significant contributions to the collective efforts in advancing LVLMs.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com