Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AttributionBench: How Hard is Automatic Attribution Evaluation? (2402.15089v1)

Published 23 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Modern generative search engines enhance the reliability of LLM responses by providing cited evidence. However, evaluating the answer's attribution, i.e., whether every claim within the generated responses is fully supported by its cited evidence, remains an open problem. This verification, traditionally dependent on costly human evaluation, underscores the urgent need for automatic attribution evaluation methods. To bridge the gap in the absence of standardized benchmarks for these methods, we present AttributionBench, a comprehensive benchmark compiled from various existing attribution datasets. Our extensive experiments on AttributionBench reveal the challenges of automatic attribution evaluation, even for state-of-the-art LLMs. Specifically, our findings show that even a fine-tuned GPT-3.5 only achieves around 80% macro-F1 under a binary classification formulation. A detailed analysis of more than 300 error cases indicates that a majority of failures stem from the model's inability to process nuanced information, and the discrepancy between the information the model has access to and that human annotators do.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  4. Understanding retrieval augmentation for long-form question answering. arXiv preprint arXiv:2310.12150.
  5. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  7. On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5271–5285, Seattle, United States. Association for Computational Linguistics.
  8. Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083.
  9. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
  10. RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
  11. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore. Association for Computational Linguistics.
  12. Improving alignment of dialogue agents via targeted human judgements. ArXiv preprint, abs/2209.14375.
  13. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  14. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States. Association for Computational Linguistics.
  15. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  16. Hagrid: A human-llm collaborative dataset for generative information-seeking with attribution. ArXiv preprint, abs/2307.16883.
  17. Internet-augmented dialogue generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8460–8478, Dublin, Ireland. Association for Computational Linguistics.
  18. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  19. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  20. API-bank: A comprehensive benchmark for tool-augmented LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, Singapore. Association for Computational Linguistics.
  21. Evaluating verifiability in generative search engines. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001–7025, Singapore. Association for Computational Linguistics.
  22. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  23. Expertqa: Expert-curated questions and attributed answers. ArXiv preprint, abs/2309.07852.
  24. Teaching language models to support answers with verified quotes. ArXiv preprint, abs/2203.11147.
  25. Augmented language models: a survey. ArXiv preprint, abs/2302.07842.
  26. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  27. OpenAI. 2023. Chatgpt (nov 06 version). https://chat.openai.com/chat.
  28. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  29. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  30. Measuring attribution in natural language generation models. arXiv preprint arXiv:2112.12870.
  31. Toolformer: Language models can teach themselves to use tools. ArXiv preprint, abs/2302.04761.
  32. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
  33. Lamda: Language models for dialog applications. ArXiv preprint, abs/2201.08239.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  35. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  36. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  37. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  38. Automatic evaluation of attribution by large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4615–4635, Singapore. Association for Computational Linguistics.
  39. Making a miracl: Multilingual information retrieval across a continuum of languages. ArXiv preprint, abs/2210.09984.
  40. Pytorch fsdp: Experiences on scaling fully sharded data parallel. Proc. VLDB Endow., 16(12):3848–3860.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yifei Li (92 papers)
  2. Xiang Yue (72 papers)
  3. Zeyi Liao (14 papers)
  4. Huan Sun (88 papers)
Citations (9)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com