LightHouse: A Survey of AGI Hallucination (2401.06792v2)
Abstract: With the development of artificial intelligence, large-scale models have become increasingly intelligent. However, numerous studies indicate that hallucinations within these large models are a bottleneck hindering the development of AI research. In the pursuit of achieving strong artificial intelligence, a significant volume of research effort is being invested in the AGI (Artificial General Intelligence) hallucination research. Previous explorations have been conducted in researching hallucinations within LLMs. As for multimodal AGI, research on hallucinations is still in an early stage. To further the progress of research in the domain of hallucinatory phenomena, we present a bird's eye view of hallucinations in AGI, summarizing the current work on AGI hallucinations and proposing some directions for future research.
- Towards the unification and robustness of perturbation and gradient based explanations. In International Conference on Machine Learning.
- Factuality challenges in the era of large language models. ArXiv, abs/2310.05189.
- Touchstone: Evaluating vision-language models by language models. ArXiv, abs/2308.16890.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. ArXiv, abs/2302.04023.
- Language models are few-shot learners. ArXiv, abs/2005.14165.
- Medbench: A large-scale chinese benchmark for evaluating medical large language models. arXiv preprint arXiv:2312.12806.
- Chateval: Towards better llm-based evaluators through multi-agent debate. ArXiv, abs/2308.07201.
- Clair: Evaluating image captions with large language models. ArXiv, abs/2310.12971.
- Autoeval-video: An automatic benchmark for assessing large vision language models in open-ended video question answering. ArXiv, abs/2311.14906.
- Cheng-Han Chiang and Hung yi Lee. 2023. Can large language models be an alternative to human evaluations? In Annual Meeting of the Association for Computational Linguistics.
- Deep reinforcement learning from human preferences. ArXiv, abs/1706.03741.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. ArXiv, abs/2306.16092.
- Chain-of-verification reduces hallucination in large language models. ArXiv, abs/2309.11495.
- How abilities in large language models are affected by supervised fine-tuning data composition. ArXiv, abs/2310.05492.
- On the origin of hallucinations in conversational models: Is it the datasets or the models? In North American Chapter of the Association for Computational Linguistics.
- Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination. In Annual Meeting of the Association for Computational Linguistics.
- Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. ArXiv, abs/2306.06687.
- Knowledge solver: Teaching llms to search for domain knowledge from knowledge graphs. ArXiv, abs/2309.03118.
- Llm4vg: Large language models evaluation for video grounding. arXiv preprint arXiv:2312.14206.
- Robert Friel and Atindriyo Sanyal. 2023. Chainpoll: A high efficacy method for llm hallucination detection. ArXiv, abs/2310.18344.
- Listen, think, and understand. ArXiv, abs/2305.10790.
- Hallusionbench: An advanced diagnostic suite for entangled language hallucination & visual illusion in large vision-language models. arXiv e-prints, pages arXiv–2310.
- Detecting and preventing hallucinations in large vision language models. ArXiv, abs/2308.06394.
- Let’s think frame by frame with vip: A video infilling and prediction dataset for evaluating video chain-of-thought. In Conference on Empirical Methods in Natural Language Processing.
- 3d-llm: Injecting the 3d world into large language models. ArXiv, abs/2307.12981.
- Rsgpt: A remote sensing vision language model and benchmark. ArXiv, abs/2307.15266.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems.
- Knowledge-constrained answer generation for open-ended video question answering. In AAAI Conference on Artificial Intelligence.
- Evaluating vlms for score-based, multi-probe annotation of 3d objects. ArXiv, abs/2311.17851.
- Factual consistency oriented speech recognition. ArXiv, abs/2302.12369.
- Post hoc explanations of language models can improve language models. ArXiv, abs/2305.11426.
- Parameter efficient audio captioning with faithful guidance using audio-text shared latent representation. ArXiv, abs/2309.03340.
- Putting people in their place: Affordance-aware human insertion into scenes. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17089–17099.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. ArXiv, abs/2005.11401.
- Seed-bench: Benchmarking multimodal llms with generative comprehension. ArXiv, abs/2307.16125.
- Videochat: Chat-centric video understanding. ArXiv, abs/2305.06355.
- Mvbench: A comprehensive multi-modal video understanding benchmark. ArXiv, abs/2311.17005.
- Evaluating object hallucination in large vision-language models. In Conference on Empirical Methods in Natural Language Processing.
- Overfitting or underfitting? understand robustness drop in adversarial training. ArXiv, abs/2010.08034.
- Uhgeval: Benchmarking the hallucination of chinese large language models via unconstrained generation. ArXiv, abs/2311.15296.
- Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122.
- Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Annual Meeting of the Association for Computational Linguistics.
- Speciality vs generality: An empirical study on catastrophic forgetting in fine-tuning foundation models. ArXiv, abs/2309.06256.
- Hallusionbench: You see what you think? or you think what you see? an image-context reasoning benchmark challenging for gpt-4v(ision), llava-1.5, and other multi-modality models. ArXiv, abs/2310.14566.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485.
- Hui Liu and Xiaojun Wan. 2023. Models see hallucinations: Evaluating the factuality in video captioning. ArXiv, abs/2303.02961.
- Evalcrafter: Benchmarking and evaluating large video generation models. ArXiv, abs/2310.11440.
- Mmbench: Is your multi-modal model an all-around player? ArXiv, abs/2307.06281.
- Negative object presence evaluation (nope) to measure object hallucination in vision-language models. ArXiv, abs/2310.05338.
- Egoschema: A diagnostic benchmark for very long-form video language understanding. ArXiv, abs/2308.09126.
- Multi-object tracking with hallucinated and unlabeled videos. ArXiv, abs/2108.08836.
- Eric Melz. 2023. Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation. ArXiv, abs/2311.04177.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. ArXiv, abs/2305.14251.
- Explanation shift: How did the distribution shift impact the model? In Arxiv.
- Certified policy smoothing for cooperative multi-agent reinforcement learning. ArXiv, abs/2212.11746.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155.
- Bleu: a method for automatic evaluation of machine translation. In Annual Meeting of the Association for Computational Linguistics.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. ArXiv, abs/2306.01116.
- Multimodal entity tagging with multimodal knowledge base. ArXiv, abs/2201.00693.
- Iterative teaching by data hallucination. In International Conference on Artificial Intelligence and Statistics.
- Direct preference optimization: Your language model is secretly a reward model. ArXiv, abs/2305.18290.
- Invite: a testbed of automatically generated invalid questions to evaluate large language models for hallucinations. In Conference on Empirical Methods in Natural Language Processing.
- Object hallucination in image captioning. In Conference on Empirical Methods in Natural Language Processing.
- Delucionqa: Detecting hallucinations in domain-specific question answering. ArXiv, abs/2312.05200.
- Towards reducing hallucination in extracting information from financial reports using large language models. ArXiv, abs/2310.10760.
- Halp: Hallucinating latent positives for skeleton-based self-supervised learning of actions. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18846–18856.
- Entity-augmented code generation. In Arxiv.
- Towards out-of-distribution generalization: A survey. ArXiv, abs/2108.13624.
- Beyond task performance: Evaluating and reducing the flaws of large multimodal models with in-context learning. ArXiv, abs/2310.00647.
- Smoothgrad: removing noise by adding noise. ArXiv, abs/1706.03825.
- Head-to-tail: How knowledgeable are large language models (llm)? a.k.a. will llms replace knowledge graphs? ArXiv, abs/2308.10168.
- Aligning large multimodal models with factually augmented rlhf. ArXiv, abs/2309.14525.
- Large language models as generalizable policies for embodied tasks. ArXiv, abs/2310.17722.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288.
- Nasib Ullah and Partha Pratim Mohanta. 2022. Thinking hallucination for video captioning. In Asian Conference on Computer Vision.
- Cider: Consensus-based image description evaluation. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566–4575.
- Behind the magic, merlim: Multi-modal evaluation benchmark for large image-language models. ArXiv, abs/2312.02219.
- Evaluation and analysis of hallucination in large vision-language models. ArXiv, abs/2308.15126.
- An llm-free multi-dimensional benchmark for mllms hallucination evaluation. ArXiv, abs/2311.07397.
- Mitigating fine-grained hallucination by fine-tuning large vision-language models with caption rewrites. ArXiv, abs/2312.01701.
- Hallucination improves the performance of unsupervised visual representation learning. ArXiv, abs/2307.12168.
- Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. ArXiv, abs/2306.09265.
- Llm lies: Hallucinations are not bugs, but features as adversarial examples. ArXiv, abs/2310.01469.
- Information-theoretic text hallucination reduction for video-grounded dialogue. In Conference on Empirical Methods in Natural Language Processing.
- Kola: Carefully benchmarking world knowledge of large language models. ArXiv, abs/2306.09296.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. ArXiv, abs/2308.02490.
- Halluaudio: Hallucinate frequency as concepts for few-shot audio classification. ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
- Bartscore: Evaluating generated text as text generation. ArXiv, abs/2106.11520.
- Alignscore: Evaluating factual consistency with a unified alignment function. In Annual Meeting of the Association for Computational Linguistics.
- Halle-switch: Controlling object hallucination in large vision language models. In Arxiv.
- Investigating the catastrophic forgetting in multimodal large language models. ArXiv, abs/2309.10313.
- Chen Zhang. 2023. User-controlled knowledge fusion in large language models: Balancing creativity and hallucination. ArXiv, abs/2307.16139.
- Bertscore: Evaluating text generation with bert. ArXiv, abs/1904.09675.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. ArXiv, abs/2309.01219.
- Boosting entity-aware image captioning with multi-modal knowledge graph. ArXiv, abs/2107.11970.
- Beyond hallucinations: Enhancing lvlms through hallucination-aware direct preference optimization. ArXiv, abs/2311.16839.
- Lima: Less is more for alignment. ArXiv, abs/2305.11206.
- Analyzing and mitigating object hallucination in large vision-language models. ArXiv, abs/2310.00754.