Benchmarking Large Language Models in Complex Question Answering Attribution using Knowledge Graphs (2401.14640v1)
Abstract: The attribution of question answering is to provide citations for supporting generated statements, and has attracted wide research attention. The current methods for automatically evaluating the attribution, which are often based on LLMs, are still inadequate, particularly in recognizing subtle differences between attributions, and complex relationships between citations and statements. To compare these attribution evaluation methods and develop new ones, we introduce a set of fine-grained categories (i.e., supportive, insufficient, contradictory and irrelevant) for measuring the attribution, and develop a Complex Attributed Question Answering (CAQA) benchmark by leveraging knowledge graphs (KGs) for automatically generating attributions of different categories to question-answer pairs. Our analysis reveals that existing evaluators perform poorly under fine-grained attribution settings and exhibit weaknesses in complex citation-statement reasoning. Our CAQA benchmark, validated with human annotations, emerges as a promising tool for selecting and developing LLM attribution evaluators.
- Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
- Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037, 2022.
- Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250, 2008.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, 2023.
- Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627, 2023.
- Beyond iid: three levels of generalization for question answering on knowledge bases. In Proceedings of the Web Conference 2021, pages 3477–3488, 2021.
- Improving sequential model editing with fact retrieval. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11209–11224, Singapore, December 2023. Association for Computational Linguistics.
- Knowledge graphs. ACM Computing Surveys (Csur), 54(4):1–37, 2021.
- TRUE: Re-evaluating factual consistency evaluation. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States, July 2022. Association for Computational Linguistics.
- Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Knowledge graph question answering datasets and their generalizability: Are they enough for future research? In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3209–3218, 2022.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. arXiv preprint arXiv:2307.16883, 2023.
- Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021.
- Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
- A survey of large language models attribution. arXiv preprint arXiv:2311.03731, 2023.
- Towards verifiable generation: A benchmark for knowledge-aware language model attribution. arXiv preprint arXiv:2310.05634, 2023.
- Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848, 2023.
- Expertqa: Expert-curated questions and attributed answers. arXiv preprint arXiv:2309.07852, 2023.
- Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
- Evaluating and modeling attribution for cross-lingual question answering. arXiv preprint arXiv:2305.14332, 2023.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- OpenAI. Gpt-4 technical report, 2023.
- Measuring attribution in natural language generation models. Computational Linguistics, pages 1–66, 2023.
- Neural graph reasoning: Complex logical query answering meets graph databases. arXiv preprint arXiv:2303.14617, 2023.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
- Retrieval augmentation reduces hallucination in conversation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Eugene Volokh. Large libel models? liability for ai output. 2023.
- On exposure bias, hallucination and domain shift in neural machine translation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3544–3552, Online, July 2020. Association for Computational Linguistics.
- On hallucination and predictive uncertainty in conditional language generation. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2734–2744, Online, April 2021. Association for Computational Linguistics.
- The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201–206, 2016.
- Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311, 2023.
- Nan Hu (34 papers)
- Jiaoyan Chen (85 papers)
- Yike Wu (13 papers)
- Guilin Qi (60 papers)
- Sheng Bi (27 papers)
- Tongtong Wu (26 papers)
- Jeff Z. Pan (78 papers)