Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Towards Verifiable Generation: A Benchmark for Knowledge-aware Language Model Attribution (2310.05634v2)

Published 9 Oct 2023 in cs.CL

Abstract: Although achieving great success, LLMs usually suffer from unreliable hallucinations. Although language attribution can be a potential solution, there are no suitable benchmarks and evaluation metrics to attribute LLMs to structured knowledge. In this paper, we define a new task of Knowledge-aware LLM Attribution (KaLMA) that improves upon three core concerns with conventional attributed LMs. First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new ``Conscious Incompetence" setting considering the incomplete knowledge repository, where the model identifies the need for supporting knowledge beyond the provided KG. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment. To implement the above innovations, we build a dataset in biography domain BioKaLMA via evolutionary question generation strategy, to control the question complexity and necessary knowledge to the answer. For evaluation, we develop a baseline solution and demonstrate the room for improvement in LLMs' citation generation, emphasizing the importance of incorporating the "Conscious Incompetence" setting, and the critical role of retrieval accuracy.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Attributed question answering: Evaluation and modeling for attributed large language models.
  2. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037.
  3. A large annotated corpus for learning natural language inference.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  6. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics.
  7. Paul R Curtiss and Phillip W Warren. 1974. The dynamics of life skills coaching. life skills series.
  8. The pascal recognising textual entailment challenge. In Machine learning challenges workshop, pages 177–190. Springer.
  9. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  10. Enabling large language models to generate text with citations.
  11. True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991.
  12. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  13. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  14. Scitail: A textual entailment dataset from science question answering. volume 32.
  15. You only need one model for open-domain question answering.
  16. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.
  17. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  18. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
  19. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  20. Knowledge-in-context: Towards knowledgeable semi-parametric language models. arXiv preprint arXiv:2210.16433.
  21. Semantics and complexity of sparql. ACM Transactions on Database Systems (TODS), 34(3):1–45.
  22. Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816–4828.
  23. Biographical semi-supervised relation extraction dataset. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3121–3130.
  24. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  25. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  26. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870.
  27. Smartbook: Ai-assisted situation report generation. arXiv preprint arXiv:2303.14337.
  28. Can artificial intelligence help for scientific writing? Critical care, 27(1):1–5.
  29. Get your vitamin C! robust fact verification with contrastive evidence. pages 624–643.
  30. Neural entity linking: A survey of models based on deep learning. Semantic Web, (Preprint):1–44.
  31. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567.
  32. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  33. FEVER: a large-scale dataset for fact extraction and VERification. pages 809–819.
  34. Llama: Open and efficient foundation language models.
  35. Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Communications of the ACM, 57(10):78–85.
  36. A broad-coverage challenge corpus for sentence understanding through inference.
  37. Wizardlm: Empowering large language models to follow complex instructions.
  38. End-to-end open-domain question answering with bertserini. arXiv preprint arXiv:1902.01718.
  39. Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311.
  40. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  41. PAWS: Paraphrase adversaries from word scrambling. pages 1298–1308.
  42. Guido Zuccon and Bevan Koopman. 2023. Dr chatgpt, tell me what i want to hear: How prompt knowledge impacts health answer correctness. arXiv preprint arXiv:2302.13793.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xinze Li (34 papers)
  2. Liangming Pan (59 papers)
  3. Yubo Ma (22 papers)
  4. Aixin Sun (99 papers)
  5. Yixin Cao (138 papers)
Citations (19)

Summary

We haven't generated a summary for this paper yet.