Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment (2406.11280v1)

Published 17 Jun 2024 in cs.CV

Abstract: Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in LLM alignment, particularly on reasoning tasks, self-aligned models applied to large video-LLMs often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Daechul Ahn (4 papers)
  2. Yura Choi (2 papers)
  3. San Kim (4 papers)
  4. Youngjae Yu (72 papers)
  5. Dongyeop Kang (72 papers)
  6. Jonghyun Choi (50 papers)
Citations (2)