i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment (2406.11280v1)

Published 17 Jun 2024 in cs.CV

Abstract: Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in LLM alignment, particularly on reasoning tasks, self-aligned models applied to large video-LLMs often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Daechul Ahn (4 papers)
Yura Choi (2 papers)
San Kim (4 papers)
Youngjae Yu (72 papers)
Dongyeop Kang (72 papers)
Jonghyun Choi (50 papers)

Citations (2)

View on Semantic Scholar

i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment (2406.11280v1)

Related Papers