Self-Supervised Visual Preference Alignment (2404.10501v2)
Abstract: This paper makes the first attempt towards unsupervised preference alignment in Vision-LLMs (VLMs). We generate chosen and rejected responses with regard to the original and augmented image pairs, and conduct preference alignment with direct preference optimization. It is based on a core idea: properly designed augmentation to the image input will induce VLM to generate false but hard negative responses, which helps the model to learn from and produce more robust and powerful answers. The whole pipeline no longer hinges on supervision from GPT-4 or human involvement during alignment, and is highly efficient with few lines of code. With only 8k randomly sampled unsupervised data, it achieves 90\% relative score to GPT-4 on complex reasoning in LLaVA-Bench, and improves LLaVA-7B/13B by 6.7\%/5.6\% score on complex multi-modal benchmark MM-Vet. Visualizations shows its improved ability to align with user-intentions. A series of ablations are firmly conducted to reveal the latent mechanism of the approach, which also indicates its potential towards further scaling. Code are available in https://github.com/Kevinz-code/SeVa.
- Explicit knowledge integration for knowledge-aware visual question answering about named entities. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, pages 29–38, 2023.
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
- A simple framework for contrastive learning of visual representations. pages 1597–1607, 2020.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems, pages 49250–49267, 2023.
- DreamLLM: Synergistic multimodal comprehension and creation. In The Twelfth International Conference on Learning Representations, 2024.
- Dtl: Disentangled transfer learning for visual recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12082–12090, 2024.
- Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689, 2023.
- Bootstrap your own latent-a new approach to self-supervised learning. In NeurIPS, pages 21271–21284, 2020.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
- GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
- Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922, 2023.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
- Multi-modal preference alignment remedies regression of visual instruction tuning on language model. arXiv preprint arXiv:2402.10884, 2024.
- Evaluating object hallucination in large vision-language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, 2023b.
- Llama-vid: An image is worth 2 tokens in large language models. arXiv preprint arXiv:2311.17043, 2023c.
- Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data. arXiv preprint arXiv:2308.10253, 2023d.
- Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
- Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023b.
- Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
- Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE, 2019.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Direct preference optimization: Your language model is secretly a reward model. 2024.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
- Learning from expert: Vision-language knowledge distillation for unsupervised cross-modal hashing retrieval. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, pages 499–507, 2023a.
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023b.
- To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023.
- Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109, 2023.
- Merlin: Empowering multimodal llms with foresight minds. arXiv preprint arXiv:2312.00589, 2023a.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
- Video retrieval for everyday scenes with common objects. In Proceedings of the 2023 ACM International Conference on Multimedia Retrieval, pages 565–570, 2023.
- Direct preference optimization of video large multimodal models from language model reward. arXiv preprint arXiv:2404.01258, 2024.
- Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023a.
- Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. arXiv preprint arXiv:2307.09474, 2023b.
- Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023c.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023a.
- Multi-label self-supervised learning with scene images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6694–6703, 2023b.
- Coarse is better? a new pipeline towards self-supervised learning with uncurated images. arXiv preprint arXiv:2306.04244, 2023c.
- Safety fine-tuning at (almost) no cost: A baseline for vision large language models. arXiv preprint arXiv:2402.02207, 2024.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.