Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations (2403.01774v2)

Published 4 Mar 2024 in cs.CL

Abstract: Enhancing the attribution in LLMs is a crucial task. One feasible approach is to enable LLMs to cite external sources that support their generations. However, existing datasets and evaluation methods in this domain still exhibit notable limitations. In this work, we formulate the task of attributed query-focused summarization (AQFS) and present WebCiteS, a Chinese dataset featuring 7k human-annotated summaries with citations. WebCiteS derives from real-world user queries and web search results, offering a valuable resource for model training and evaluation. Prior works in attribution evaluation do not differentiate between groundedness errors and citation errors. They also fall short in automatically verifying sentences that draw partial support from multiple sources. We tackle these issues by developing detailed metrics and enabling the automatic evaluator to decompose the sentences into sub-claims for fine-grained verification. Our comprehensive evaluation of both open-source and proprietary models on WebCiteS highlights the challenge LLMs face in correctly citing sources, underscoring the necessity for further improvement. The dataset and code will be open-sourced to facilitate further research in this crucial field.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Qampari: An open-domain question answering benchmark for questions with many answers from multiple paragraphs.
  2. Attributed question answering: Evaluation and modeling for attributed large language models. ArXiv preprint, abs/2212.08037.
  3. WikiHowQA: A comprehensive benchmark for multi-document non-factoid question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5291–5314, Toronto, Canada. Association for Computational Linguistics.
  4. Howsumm: a multi-document summarization dataset derived from wikihow articles. ArXiv preprint, abs/2110.03179.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  6. Complex claim verification with evidence retrieved in the wild. ArXiv, abs/2305.11859.
  7. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  8. GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
  9. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy. Association for Computational Linguistics.
  10. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
  11. RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
  12. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore. Association for Computational Linguistics.
  13. Repairing the cracked foundation: A survey of obstacles in evaluation practices for generated text. J. Artif. Int. Res., 77.
  14. Annollm: Making large language models to be better crowdsourced annotators. ArXiv preprint, abs/2303.16854.
  15. The curious case of neural text degeneration. In International Conference on Learning Representations.
  16. Retrieving supporting evidence for generative question answering. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, SIGIR-AP ’23, page 11–20, New York, NY, USA. Association for Computing Machinery.
  17. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  18. WiCE: Real-world entailment for claims in Wikipedia. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7561–7583, Singapore. Association for Computational Linguistics.
  19. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  20. Question and answer test-train overlap in open-domain question answering datasets. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1000–1008, Online. Association for Computational Linguistics.
  21. A survey of large language models attribution.
  22. Llatrieval: Llm-verified retrieval for verifiable generation.
  23. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  24. Evaluating verifiability in generative search engines. ArXiv preprint, abs/2304.09848.
  25. Webglm: Towards an efficient web-enhanced question answering system with human preferences. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, page 4549–4560, New York, NY, USA. Association for Computing Machinery.
  26. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4140–4170, Toronto, Canada. Association for Computational Linguistics.
  27. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  28. Mixed precision training.
  29. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  30. Webgpt: Browser-assisted question-answering with human feedback. ArXiv preprint, abs/2112.09332.
  31. Ani Nenkova and Rebecca Passonneau. 2004. Evaluating content selection in summarization: The pyramid method. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, pages 145–152, Boston, Massachusetts, USA. Association for Computational Linguistics.
  32. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  33. R. Passonneau. 2009. Formal and functional assessment of the pyramid method for summary content evaluation*. Natural Language Engineering, 16:107 – 131.
  34. Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels. Association for Computational Linguistics.
  35. WebCPM: Interactive web search for Chinese long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8968–8988, Toronto, Canada. Association for Computational Linguistics.
  36. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press.
  37. Measuring Attribution in Natural Language Generation Models. Computational Linguistics, pages 1–64.
  38. Crowdsourcing lightweight pyramids for manual summary evaluation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 682–687, Minneapolis, Minnesota. Association for Computational Linguistics.
  39. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana. Association for Computational Linguistics.
  40. Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv preprint, abs/1909.08053.
  41. ASQA: Factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  42. "according to …" prompting language models improves quoting from pre-training data.
  43. A critical evaluation of evaluations for long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3225–3245, Toronto, Canada. Association for Computational Linguistics.
  44. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  45. Baichuan 2: Open large-scale language models. ArXiv, abs/2309.10305.
  46. Automatic evaluation of attribution by large language models. ArXiv preprint, abs/2305.06311.
  47. Fengshenbang 1.0: Being the foundation of chinese cognitive intelligence. ArXiv preprint, abs/2209.02970.
  48. Instruction tuning for large language models: A survey.
  49. Shiyue Zhang and Mohit Bansal. 2021. Finding a balanced degree of automation for summary evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6617–6632, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  50. Benchmarking large language models for news summarization. ArXiv preprint, abs/2301.13848.
  51. Siren’s song in the ai ocean: A survey on hallucination in large language models.
  52. A survey of large language models. ArXiv preprint, abs/2303.18223.
  53. Instruction-following evaluation for large language models.
  54. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1097–1100, New York, NY, USA. Association for Computing Machinery.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Haolin Deng (2 papers)
  2. Chang Wang (28 papers)
  3. Xin Li (980 papers)
  4. Dezhang Yuan (1 paper)
  5. Junlang Zhan (4 papers)
  6. Tianhua Zhou (6 papers)
  7. Jin Ma (64 papers)
  8. Jun Gao (267 papers)
  9. Ruifeng Xu (66 papers)

Summary

We haven't generated a summary for this paper yet.