i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment (2406.11280v1)
Abstract: Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in LLM alignment, particularly on reasoning tasks, self-aligned models applied to large video-LLMs often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.
- Daechul Ahn (4 papers)
- Yura Choi (2 papers)
- San Kim (4 papers)
- Youngjae Yu (72 papers)
- Dongyeop Kang (72 papers)
- Jonghyun Choi (50 papers)