Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Temporal Insight Enhancement: Mitigating Temporal Hallucination in Multimodal Large Language Models (2401.09861v1)

Published 18 Jan 2024 in cs.CV and cs.AI

Abstract: Recent advancements in Multimodal LLMs (MLLMs) have significantly enhanced the comprehension of multimedia content, bringing together diverse modalities such as text, images, and videos. However, a critical challenge faced by these models, especially when processing video inputs, is the occurrence of hallucinations - erroneous perceptions or interpretations, particularly at the event level. This study introduces an innovative method to address event-level hallucinations in MLLMs, focusing on specific temporal understanding in video content. Our approach leverages a novel framework that extracts and utilizes event-specific information from both the event query and the provided video to refine MLLMs' response. We propose a unique mechanism that decomposes on-demand event queries into iconic actions. Subsequently, we employ models like CLIP and BLIP2 to predict specific timestamps for event occurrences. Our evaluation, conducted using the Charades-STA dataset, demonstrates a significant reduction in temporal hallucinations and an improvement in the quality of event-related responses. This research not only provides a new perspective in addressing a critical limitation of MLLMs but also contributes a quantitatively measurable method for evaluating MLLMs in the context of temporal-related questions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Learning transferable visual models from natural language supervision,” in Proceedings of the International Conference on Machine Learning, 2021, pp. 8748–8763.
  2. “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Proceedings of the International Conference on Machine Learning. PMLR, 2022, pp. 12888–12900.
  3. “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proceedings of the International Conference on Machine Learning, 2023, pp. 19730–19742.
  4. “Language models are few-shot learners,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2020, pp. 1877–1901.
  5. “Training language models to follow instructions with human feedback,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2022, pp. 27730–27744.
  6. “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  7. “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  8. “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  9. “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
  10. “Minigpt-v2: Large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478, 2023.
  11. “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
  12. “Improved baselines with visual instruction tuning,” arXiv preprint arXiv:2310.03744, 2023.
  13. “Video-llama: An instruction-tuned audio-visual language model for video understanding,” arXiv preprint arXiv:2306.02858, 2023.
  14. “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922, 2023.
  15. “Siren’s song in the ai ocean: A survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
  16. “Woodpecker: Hallucination correction for multimodal large language models,” arXiv preprint arXiv:2310.16045, 2023.
  17. “Evaluating object hallucination in large vision-language models,” arXiv preprint arXiv:2305.10355, 2023.
  18. “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023.
  19. “Chain-of-verification reduces hallucination in large language models,” arXiv preprint arXiv:2309.11495, 2023.
  20. “Retrieval augmentation reduces hallucination in conversation,” in Findings of the Association for Computational Linguistics: EMNLP, 2021, pp. 3784–3803.
  21. “Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios,” arXiv preprint arXiv:2307.13528, 2023.
  22. “Verify-and-edit: A knowledge-enhanced chain-of-thought framework,” arXiv preprint arXiv:2305.03268, 2023.
  23. “Test-time distribution normalization for contrastively learned visual-language models,” in Annual Conference on Neural Information Processing Systems, NeurIPS, 2023.
  24. “Tall: Temporal activity localization via language query,” in Proceedings of the IEEE international conference on computer vision, ICCV, 2017, pp. 5267–5275.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Li Sun (135 papers)
  2. Liuan Wang (1 paper)
  3. Jun Sun (210 papers)
  4. Takayuki Okatani (63 papers)