Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models (2401.09861v1)
Abstract: Recent advancements in Multimodal LLMs (MLLMs) have significantly enhanced the comprehension of multimedia content, bringing together diverse modalities such as text, images, and videos. However, a critical challenge faced by these models, especially when processing video inputs, is the occurrence of hallucinations - erroneous perceptions or interpretations, particularly at the event level. This study introduces an innovative method to address event-level hallucinations in MLLMs, focusing on specific temporal understanding in video content. Our approach leverages a novel framework that extracts and utilizes event-specific information from both the event query and the provided video to refine MLLMs' response. We propose a unique mechanism that decomposes on-demand event queries into iconic actions. Subsequently, we employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences. Our evaluation, conducted using the Charades-STA dataset, demonstrates a significant reduction in temporal hallucinations and an improvement in the quality of event-related responses. This research not only provides a new perspective in addressing a critical limitation of MLLMs but also contributes a quantitatively measurable method for evaluating MLLMs in the context of temporal-related questions.
- “Learning transferable visual models from natural language supervision,” in Proceedings of the International Conference on Machine Learning, 2021, pp. 8748–8763.
- “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 12888–12900.
- “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proceedings of the International Conference on Machine Learning, 2023, pp. 19730–19742.
- “Language models are few-shot learners,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2020, pp. 1877–1901.
- “Training language models to follow instructions with human feedback,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2022, pp. 27730–27744.
- “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
- “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
- “Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478, 2023.
- “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
- “Improved baselines with visual instruction tuning,” arXiv preprint arXiv:2310.03744, 2023.
- “Video-llama: An instruction-tuned audio-visual language model for video understanding,” arXiv preprint arXiv:2306.02858, 2023.
- “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922, 2023.
- “Siren’s song in the ai ocean: A survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
- “Woodpecker: Hallucination correction for multimodal large language models,” arXiv preprint arXiv:2310.16045, 2023.
- “Evaluating object hallucination in large vision-language models,” arXiv preprint arXiv:2305.10355, 2023.
- “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023.
- “Chain-of-verification reduces hallucination in large language models,” arXiv preprint arXiv:2309.11495, 2023.
- “Retrieval augmentation reduces hallucination in conversation,” in Findings of the Association for Computational Linguistics: EMNLP, 2021, pp. 3784–3803.
- “Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios,” arXiv preprint arXiv:2307.13528, 2023.
- “Verify-and-edit: A knowledge-enhanced chain-of-thought framework,” arXiv preprint arXiv:2305.03268, 2023.
- “Test-time distribution normalization for contrastively learned visual-language models,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2023.
- “Tall: Temporal activity localization via language query,” in Proceedings of the IEEE international conference on computer vision, ICCV, 2017, pp. 5267–5275.
- Li Sun (135 papers)
- Liuan Wang (1 paper)
- Jun Sun (210 papers)
- Takayuki Okatani (63 papers)