Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward (2404.01258v2)

Published 1 Apr 2024 in cs.CV and cs.AI
Direct Preference Optimization of Video Large Multimodal Models from Language Model Reward

Abstract: Preference modeling techniques, such as direct preference optimization (DPO), has shown effective in enhancing the generalization abilities of LLM. However, in tasks involving video instruction-following, providing informative feedback, especially for detecting hallucinations in generated responses, remains a significant challenge. Previous studies have explored using large large multimodal models (LMMs) as reward models to guide preference modeling, but their ability to accurately assess the factuality of generated responses compared to corresponding videos has not been conclusively established. This paper introduces a novel framework that utilizes detailed video captions as a proxy of video content, enabling LLMs to incorporate this information as supporting evidence for scoring video Question Answering (QA) predictions. Our approach demonstrates robust alignment with OpenAI GPT-4V model's reward mechanism, which directly takes video frames as input. Furthermore, we show that applying this tailored reward through DPO significantly improves the performance of video LMMs on video QA tasks.

Enhancing Video Large Multimodal Models with Direct Preference Optimization from LLM Rewards

Introduction

Researchers have developed a novel framework that leverages Direct Preference Optimization (DPO) to substantially improve the performance of video large multimodal models (video LMMs) on Video Question Answering (Video QA) tasks. This groundbreaking work introduces an innovative reward mechanism using detailed video captions as a proxy for video content. This enables the LLMs to assess the accuracy of responses generated by video LMMs more effectively.

The Challenge

In the face of escalating demand for video content understanding, enhancing the capability of video LMMs to accurately follow video instructions has emerged as a significant challenge. Traditional Reinforcement Learning (RL) and DPO approaches, while effective in text-based domains, have struggled with multimodal contexts, such as video, primarily due to difficulties in developing robust reward systems. Addressing the challenges of costly human preference data collection and scalability issues with reinforcement learning models, the paper proposes a new approach that leverages video captions to improve model alignment and performance in video-based tasks.

Dataset and Methodology

To address the challenges in evaluating video LMMs, the researchers devised a comprehensive dataset named ShareGPTVideo. The dataset contains 900k detailed video captions, capturing a wide range of video content elements such as temporal dynamics and spatial relationships. These captions serve as a foundation for the proposed reward mechanism by providing a rich source of information for LLMs to assess the factual alignment of video LMM responses.

The paper outlines a three-stage training pipeline for the proposed framework:

  1. Caption Pre-training Stage: Utilizes the newly introduced video caption data for pre-training, enriching the model's understanding of video content.
  2. Supervised Fine-Tuning (SFT) Stage: Involves fine-tuning with video instruction-following data generated from the detailed video captions, ensuring the model's responses are grounded in the video content.
  3. Direct Preference Optimization (DPO) Stage: Applies the DPO algorithm to refine the model's responses further, using rewards derived from a LLM's assessment of the responses' factual alignment.

Experimental Results

The experimental evaluation demonstrates the effectiveness of the proposed framework in enhancing video LMMs' performance on video QA tasks. Notably, the LLaVA-Hound-DPO model, which incorporates the DPO training stage, achieved an 8.1% improvement in accuracy over its SFT counterpart. This significant performance enhancement illustrates the value of utilizing video captions as proxies for video content in the DPO process.

Implications and Future Work

This research represents a significant advancement in the alignment and performance of video LMMs on video QA tasks. The introduction of a cost-effective and scalable reward mechanism using detailed video captions as proxies offers a promising direction for future work in multimodal model training and evaluation. The work also opens up new possibilities for exploring other domains where video content understanding is critical. Future research might include expanding the dataset to cover a broader range of video types and exploring other model architectures to further improve performance and alignment in video-based tasks.

Conclusion

In conclusion, this paper presents a novel approach to improving video LMMs through a detailed video caption dataset and a tailored DPO method. The proposed framework not only enhances model performance on video QA tasks but also addresses the scalability challenges associated with training multimodal models. This work lays a solid foundation for further research in video content understanding and model alignment, marking a notable contribution to the field of AI and multimodal learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Tuning large multimodal models for videos using reinforcement learning from ai feedback. arXiv preprint arXiv:2402.03746, 2024.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  4. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  1728–1738, 2021.
  5. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp.  190–200, 2011.
  6. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024a.
  7. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  8. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024b.
  9. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  10. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
  11. V-star: Training verifiers for self-taught reasoners. arXiv preprint arXiv:2402.06457, 2024.
  12. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  13. Tgif-qa: Toward spatio-temporal reasoning in visual question answering. In CVPR, 2017.
  14. Chat-univi: Unified visual representation empowers large language models with image and video understanding. arXiv preprint arXiv:2311.08046, 2023.
  15. Video-lavit: Unified video-language pre-training with decoupled visual-motional tokenization. arXiv preprint arXiv:2402.03161, 2024.
  16. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  17. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  18. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  19. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005, 2023c.
  20. Silkie: Preference distillation for large visual language models. arXiv preprint arXiv:2312.10665, 2023d.
  21. Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023e.
  22. Video-llava: Learning united visual representation by alignment before projection, 2023a.
  23. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023b.
  24. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  25. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  26. One for all: Video conversation is feasible without video instruction tuning. arXiv preprint arXiv:2309.15785, 2023c.
  27. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
  28. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  29. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  30. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  31. Howtocaption: Prompting llms to transform video annotations at scale, 2023.
  32. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023a.
  33. Salmon: Self-alignment with principle-following reward models. arXiv preprint arXiv:2310.05910, 2023b.
  34. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023.
  35. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5288–5296, 2016.
  36. Zero-shot video question answering via frozen bidirectional language models. NeurIPS, 2022.
  37. Activitynet-qa: A dataset for understanding complex web videos via question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  9127–9134, 2019.
  38. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
  39. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  40. Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023.
  41. Sotopia: Interactive evaluation for social intelligence in language agents. arXiv preprint arXiv:2310.11667, 2023.
  42. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Ruohong Zhang (11 papers)
  2. Liangke Gui (8 papers)
  3. Zhiqing Sun (35 papers)
  4. Yihao Feng (35 papers)
  5. Keyang Xu (12 papers)
  6. Yuanhan Zhang (29 papers)
  7. Di Fu (20 papers)
  8. Chunyuan Li (122 papers)
  9. Alexander Hauptmann (46 papers)
  10. Yonatan Bisk (91 papers)
  11. Yiming Yang (151 papers)
Citations (29)