Papers
Topics
Authors
Recent
Search
2000 character limit reached

Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs

Published 26 Jan 2024 in cs.CL | (2401.14640v2)

Abstract: Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA. All the codes and data are publicly accessible at https://github.com/HuuuNan/CAQA-Benchmark.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
  2. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037, 2022.
  3. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1247–1250, 2008.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  6. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, 2023.
  7. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627, 2023.
  8. Beyond iid: three levels of generalization for question answering on knowledge bases. In Proceedings of the Web Conference 2021, pages 3477–3488, 2021.
  9. Improving sequential model editing with fact retrieval. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11209–11224, Singapore, December 2023. Association for Computational Linguistics.
  10. Knowledge graphs. ACM Computing Surveys (Csur), 54(4):1–37, 2021.
  11. TRUE: Re-evaluating factual consistency evaluation. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States, July 2022. Association for Computational Linguistics.
  12. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  13. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  14. Knowledge graph question answering datasets and their generalizability: Are they enough for future research? In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3209–3218, 2022.
  15. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  16. Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. arXiv preprint arXiv:2307.16883, 2023.
  17. Internet-augmented dialogue generation. arXiv preprint arXiv:2107.07566, 2021.
  18. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
  19. A survey of large language models attribution. arXiv preprint arXiv:2311.03731, 2023.
  20. Towards verifiable generation: A benchmark for knowledge-aware language model attribution. arXiv preprint arXiv:2310.05634, 2023.
  21. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848, 2023.
  22. Expertqa: Expert-curated questions and attributed answers. arXiv preprint arXiv:2309.07852, 2023.
  23. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.
  24. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
  25. Evaluating and modeling attribution for cross-lingual question answering. arXiv preprint arXiv:2305.14332, 2023.
  26. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  27. OpenAI. Gpt-4 technical report, 2023.
  28. Measuring attribution in natural language generation models. Computational Linguistics, pages 1–66, 2023.
  29. Neural graph reasoning: Complex logical query answering meets graph databases. arXiv preprint arXiv:2303.14617, 2023.
  30. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  31. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  32. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  33. Retrieval augmentation reduces hallucination in conversation. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  36. Eugene Volokh. Large libel models? liability for ai output. 2023.
  37. On exposure bias, hallucination and domain shift in neural machine translation. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3544–3552, Online, July 2020. Association for Computational Linguistics.
  38. On hallucination and predictive uncertainty in conditional language generation. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, editors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2734–2744, Online, April 2021. Association for Computational Linguistics.
  39. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201–206, 2016.
  40. Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311, 2023.
Citations (7)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.