Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling (2206.07160v1)

Published 14 Jun 2022 in cs.CV

Abstract: Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence generation. However, existing video-language (VidL) models still require task-specific designs in model architecture and training objectives for each task. In this work, we explore a unified VidL framework LAVENDER, where Masked LLMing (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate the advantage of LAVENDER over existing VidL methods in: (i) supporting all downstream tasks with just a single set of parameter values when multi-task finetuned; (ii) few-shot generalization on various downstream tasks; and (iii) enabling zero-shot evaluation on video question answering tasks. Code is available at https://github.com/microsoft/LAVENDER.

Unifying Video-Language Understanding with Lavender

The paper "Lavender: Unifying Video-Language Understanding as Masked LLMing" seeks to streamline the architecture and training objectives of video-language (VidL) models by employing a unified framework where Masked LLMing (MLM) serves as the common interface for both pre-training and downstream tasks. This approach markedly simplifies the model architecture: instead of employing a complex encoder-decoder structure, a lightweight MLM head is used on top of the multimodal encoder. The experimental evidence presented in the paper indicates that this unified approach delivers competitive performance across 14 VidL benchmarks, including video question answering, text-to-video retrieval, and video captioning.

Methodology and Results

Lavender leverages MLM as the sole task for both pre-training and all downstream adaptations. This is a departure from existing VidL models that require distinct architectures and objectives for each task such as separate heads for Video Text Matching (VTM) and task-specific heads in downstream adaptation. Lavender uniquely integrates VTM into the MLM framework by using the same [MASK] token commonly employed in MLM, thereby removing the need for separate binary classification heads generally used in VTM tasks.

The paper reports numerical results that demonstrate Lavender's effectiveness. Notably, Lavender achieves state-of-the-art or competitive performance on several benchmarks, including TGIF-QA tasks with gains such as TGIF-Action (+2.6%), TGIF-Transition (+2.9%), MSVD-QA (+10.3%), and LSMDC-FiB (+4.2%). Its design allows the use of a single set of parameters across different VidL tasks, supporting multi-task fine-tuning, few-shot generalization, and zero-shot evaluation. Particularly, Lavender exhibits robustness in zero-shot settings for video question answering, outperforming comparable models on various QA datasets without additional supervised tuning.

Implications and Future Work

The findings carry significant implications for both practical application and theoretical exploration of VidL frameworks. Practically, the unification promotes efficiency by reducing the need for task-specific engineering, thereby simplifying model deployment and enabling more economical utilization of computational resources. Theoretically, the approach challenges the necessity of complex model heads and objectives for multimodal tasks, proposing instead that unified LLMing could sufficiently address diverse VidL challenges.

Future research directions could explore extending Lavender's capabilities to tasks requiring more granular temporal alignment (e.g., fine-grained moment retrieval in videos) and enhancing prompt-based methods to further improve task-generalization in low-resource scenarios. Moreover, while Lavender demonstrates state-of-the-art performance, further investigation into in-context learning or prompt tuning could yield advancements in adaptability and performance.

As the paper acknowledges, data-driven models such as Lavender carry inherent risks associated with data bias and energy consumption; however, advancements in unified frameworks like Lavender may mitigate these issues by enhancing model adaptability and resource efficiency.

Overall, "Lavender: Unifying Video-Language Understanding as Masked LLMing" advances the VidL field with a streamlined and efficient approach, demonstrating the potential of MLM as a versatile framework across multimodal tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Linjie Li (89 papers)
  2. Zhe Gan (135 papers)
  3. Kevin Lin (98 papers)
  4. Chung-Ching Lin (36 papers)
  5. Zicheng Liu (153 papers)
  6. Ce Liu (51 papers)
  7. Lijuan Wang (133 papers)
Citations (72)
Github Logo Streamline Icon: https://streamlinehq.com