LITA: Language Instructed Temporal-Localization Assistant (2403.19046v1)

Published 27 Mar 2024 in cs.CV and cs.AI

Abstract: There has been tremendous progress in multimodal LLMs. Recent works have extended these models to video input with promising instruction following capabilities. However, an important missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset, ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding. Code is available at: https://github.com/NVlabs/LITA

PDF HTML Abstract

LITA: Enhancing Temporal Localization in Video LLMs

Introduction

The evolution of LLMs has extended their capabilities to multimodal inputs, including videos, opening new avenues for understanding and generating content based on video data. Despite these advancements, a critical challenge persists in the domain of video-based models—temporal localization, or the ability to accurately pinpoint "when" specific events occur within a video. This paper introduces the Language Instructed Temporal-Localization Assistant (LITA), a novel approach designed to address the limitations in temporal localization observed in current Video LLMs (Video LLMs).

Key Challenges in Temporal Localization

Temporal localization in videos is an essential aspect that distinguishes video data from static images. Accurately identifying the timing of events within videos is crucial for various applications, yet existing Video LLMs face significant challenges in this area, primarily due to limitations in time representation, architectural design, and the nature of the data they are trained on. LITA addresses these issues through innovative solutions in each of these domains.

LITA's Contributions

LITA introduces several key innovations to enhance temporal localization in Video LLMs:

Time Tokens: A novel method of encoding timestamps relative to the video length, allowing for more precise temporal localization without relying on absolute time representations.
SlowFast Tokens: An architectural innovation that captures temporal information at a fine resolution, facilitating accurate event localization within videos.
Data Emphasis on Temporal Localization: A focused approach to training data, incorporating existing video datasets with accurate timestamps and introducing a new dataset and task specifically designed for temporal localization training and evaluation.

Reasoning Temporal Localization (RTL) Task and Dataset

One of the most significant contributions of LITA is the proposal of the Reasoning Temporal Localization (RTL) task, accompanied by a new dataset, ActivityNet-RTL. This task challenges models to not only localize events in time but also to engage in reasoning to deduce answers to complex queries. LITA has demonstrated remarkable performance on this challenging task, nearly doubling the mean intersection-over-union (mIoU) scores of baseline models while also showing significant improvement in video-based text generation tasks.

Implications and Future Directions

The innovations introduced by LITA have several implications for the field of AI and LLMs:

Improved Temporal Localization: LITA's methodology for representing time and its architecture for processing video data significantly enhance temporal localization capabilities in Video LLMs.
Enhanced Video Understanding: Beyond temporal localization, LITA has shown to improve general video understanding, as evidenced by its performance on various video-based text generation tasks.
Potential for Wider Applications: LITA's advancements open new possibilities for applications requiring precise understanding of events in videos, from content creation and summarization to surveillance and activity recognition.

Looking ahead, the concepts and methodologies introduced by LITA could inspire further research in the field of Video LLMs, particularly in improving temporal understanding and reasoning. Additionally, the promising results of the RTL task and the ActivityNet-RTL dataset suggest avenues for expanding and refining training data and tasks in this domain.

Conclusion

LITA represents a significant step forward in addressing the current limitations of temporal localization in Video LLMs. Through innovative approaches to time representation, architectural design, and focused training data, LITA not only enhances temporal localization capabilities but also improves overall video understanding. The introduction of the RTL task and the ActivityNet-RTL dataset further underscore the potential for LLMs to tackle increasingly complex video-based challenges, paving the way for future developments in this rapidly evolving field.