Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding (2501.08282v2)

Published 14 Jan 2025 in cs.CV

Abstract: Recent advancements in multimodal LLMs (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST .

Summary

  • The paper introduces LLaVA-ST, a multimodal large language model designed for fine-grained spatial-temporal understanding in videos through novel embedding and packing techniques.
  • Key methods include Language-Aligned Positional Embedding (LAPE) for text-visual alignment and Spatial-Temporal Packer (STP) for preserving detailed features during compression.
  • Authors also created the ST-Align dataset and demonstrate LLaVA-ST's state-of-the-art performance on 11 tasks requiring spatial-temporal video comprehension.

The paper presents LLaVA-ST, a Multimodal LLM (MLLM) tailored for fine-grained spatial-temporal understanding. The core challenges the model addresses are the integration of spatial-temporal localization and the encoding of fine-grained temporal and spatial data during video compression.

Key Contributions:

  1. Language-Aligned Positional Embedding (LAPE): This innovative embedding technique directly maps the text-based coordinate representations into the visual feature space to simplify multimodal alignment by using special tokens for coordinates, thus bypassing the usual alignment complexities.
  2. Spatial-Temporal Packer (STP): STP employs a novel approach by decoupling compression into distinct temporal and spatial streams using a point-to-region attention mechanism. This methodology preserves detailed spatial-temporal relationships within the compressed video features, enhancing the model's comprehension capabilities.
  3. ST-Align Dataset and Benchmark: The authors introduce the ST-Align dataset, comprising 4.3 million training samples specifically crafted for fine-grained spatial-temporal multimodal understanding. This robust dataset supports the evaluation of tasks like Spatial-Temporal Video Grounding (STVG), Event Localization and Captioning (ELC), and Spatial Video Grounding (SVG).
  4. Progressive Training Pipeline: LLaVA-ST is trained through a three-stage pipeline—consisting of content alignment, coordinate alignment, and multi-task instruction tuning—designed to progressively synchronize visual and textual features from coarse to fine granularity.

Experiments and Results:

LLaVA-ST achieves significant performance improvements across multiple benchmarks, demonstrating outstanding performance on 11 diverse tasks requiring fine-grained spatial, temporal, and interleaved multimodal understanding. Notably, it excels in spatial-temporal interleaved tasks, outperforming existing models by substantial margins, thus establishing its efficacy in complex video-language tasks.

Overall, LLaVA-ST represents a sophisticated advancement in modeling that effectively integrates and processes fine-grained spatial-temporal data, establishing a new standard for MLLMs in complex video-language comprehension.