RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in Large Vision Language Models (2405.17821v2)
Abstract: Recent advancements in Large Vision LLMs (LVLMs) have revolutionized how machines understand and generate textual responses based on visual inputs, yet they often produce "hallucinatory" outputs that misinterpret visual information, posing challenges in reliability and trustworthiness. We propose RITUAL, a simple decoding method that reduces hallucinations by leveraging randomly transformed images as complementary inputs during decoding, adjusting the output probability distribution without additional training or external models. Our key insight is that random transformations expose the model to diverse visual perspectives, enabling it to correct misinterpretations that lead to hallucinations. Specifically, when a model hallucinates based on the original image, the transformed images -- altered in aspects such as orientation, scale, or color -- provide alternative viewpoints that help recalibrate the model's predictions. By integrating the probability distributions from both the original and transformed images, RITUAL effectively reduces hallucinations. To further improve reliability and address potential instability from arbitrary transformations, we introduce RITUAL+, an extension that selects image transformations based on self-feedback from the LVLM. Instead of applying transformations randomly, RITUAL+ uses the LVLM to evaluate and choose transformations that are most beneficial for reducing hallucinations in a given context. This self-adaptive approach mitigates the potential negative impact of certain transformations on specific tasks, ensuring more consistent performance across different scenarios. Experiments demonstrate that RITUAL and RITUAL+ significantly reduce hallucinations across several object hallucination benchmarks.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
- Video generation models as world simulators. 2024.
- Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
- Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
- Seeing is believing: Mitigating hallucination in large vision-language models via clip-guided decoding. arXiv preprint arXiv:2402.15300, 2024.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Multi-modal hallucination control by visual information grounding. arXiv preprint arXiv:2403.14003, 2024.
- Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2024.
- Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
- Vision-language models can identify distracted driver behavior from naturalistic videos. IEEE Transactions on Intelligent Transportation Systems, 2024.
- The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
- Opera: Alleviating hallucination in multi-modal large language models via over-trust penalty and retrospection-allocation. arXiv preprint arXiv:2311.17911, 2023.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
- Hallucination augmented contrastive learning for multimodal large language model. arXiv preprint arXiv:2312.06968, 2023.
- What if…?: Counterfactual inception to mitigate hallucination effects in large multimodal models. arXiv preprint arXiv:2403.13513, 2024.
- Volcano: mitigating multimodal hallucination through self-feedback guided revision. arXiv preprint arXiv:2311.07362, 2023.
- Mitigating object hallucinations in large vision-language models through visual contrastive decoding. arXiv preprint arXiv:2311.16922, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
- Visual evidence prompting mitigates hallucinations in multimodal large language models. 2023.
- Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097, 2022.
- Automated evaluation of large vision-language models on self-driving corner cases. arXiv preprint arXiv:2404.10595, 2024.
- Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Mitigating hallucination in large multi-modal models via robust instruction tuning. In The Twelfth International Conference on Learning Representations, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
- Visual instruction tuning. Advances in neural information processing systems, 36, 2023.
- Qilin-med-vl: Towards chinese large vision-language model for general healthcare. arXiv preprint arXiv:2310.17956, 2023.
- Evaluation and enhancement of semantic grounding in large vision-language models. In AAAI-ReLM Workshop, 2024.
- OpenAI. Chatgpt interaction. Personal communication, 2023.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
- A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
- Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Contrastive region guidance: Improving grounding in vision-language models without training. arXiv preprint arXiv:2403.02325, 2024.
- Vigc: Visual instruction generation and correction. arXiv preprint arXiv:2308.12714, 2023.
- Amber: An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023.
- Mitigating hallucinations in large vision-language models with instruction contrastive decoding. arXiv preprint arXiv:2403.18715, 2024.
- Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960, 2016.
- Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 6074–6082, 2024.
- Pensieve: Retrospect-then-compare mitigates visual hallucination. arXiv preprint arXiv:2403.14401, 2024.
- Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023.
- Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. arXiv preprint arXiv:2312.00849, 2023.
- Less is more: Mitigating multimodal hallucination from an eos decision perspective, 2024.
- Halle-control: Controlling object hallucination in large multimodal models, 2024.
- Debiasing large visual language models. arXiv preprint arXiv:2403.05262, 2024.
- Mitigating object hallucination in large vision-language models via classifier-free guidance. arXiv preprint arXiv:2402.08680, 2024.
- Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. arXiv preprint arXiv:2311.16839, 2023.
- A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112, 2023.
- Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Sangmin Woo (20 papers)
- Jaehyuk Jang (4 papers)
- Yubin Choi (6 papers)
- Changick Kim (75 papers)
- Donguk kim (8 papers)