Enabling Large Language Models to Generate Text with Citations (2305.14627v2)

Published 24 May 2023 in cs.CL, cs.IR, and cs.LG

Abstract: LLMs have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We develop automatic metrics along three dimensions -- fluency, correctness, and citation quality -- and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvement -- For example, on the ELI5 dataset, even the best models lack complete citation support 50% of the time. Our analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.

Citations (244)

View on Semantic Scholar

Summary

The paper introduces ALCE, a novel benchmark that compels LLMs to generate verifiable text with citations to mitigate hallucination.
It employs automatic metrics for fluency, correctness, and citation quality, establishing a strong correlation with human judgment.
The study demonstrates that integrating advanced retrieval and synthesis strategies can significantly improve citation accuracy in LLM applications.

Enabling LLMs to Generate Text with Citations

Introduction

The capabilities of LLMs extend to various use cases, including information retrieval and text generation. However, a significant challenge that persists is the tendency of these models to generate content that may not always be factually accurate, termed as "hallucination". To address this, a new paradigm that mandates LLMs to generate text with citations has emerged. This approach not only enhances factual correctness but also ensures verifiability of the generated outputs. In the light of this development, a benchmark named ALCE (Automatic LLMs' Citation Evaluation) has been designed to evaluate the efficacy of LLMs in generating text with citations, marking a substantial advancement in the field.

The ALCE Benchmark

ALCE stands as the first of its kind, a reproducible benchmark tailored for the automatic evaluation of text generation by LLMs with an emphasis on citation quality. The benchmark introduces a novel task that encompasses a natural-language question and a retrieval corpus, mandating an end-to-end system capable of retrieving relevant passages, generating a response, and appropriately citing the supporting passages. It incorporates innovative automatic metrics across three dimensions -- fluency, correctness, and citation quality -- and exhibits a strong correlation with human judgment. The benchmark encompasses various datasets like ASQA, QAMPARI, and ELI5, each presenting unique challenges and emphasizing the importance of citation in generated text.

Evaluation Metrics

ALCE employs a multifaceted approach to evaluation, measuring:

Fluency through MAUVE, ensuring generated text is coherent and well-formed.
Correctness through tailored metrics such as EM recall for ASQA and claim recall for ELI5, assessing the accuracy and coverage of generated responses.
Citation Quality through metrics that evaluate both the recall and precision of citations in the generated text, assessed automatically using an NLI model.

This comprehensive evaluation framework ensures that improvements on the ALCE benchmark correlate with enhancements in the practical utility and reliability of LLM-generated content.

Modeling Approaches and Future Directions

Various modeling strategies were explored, from employing dense retrievers like GTR for information retrieval to novel prompting strategies aimed at better synthesizing and citing information. While all systems demonstrated proficiency in generating fluent responses, there was a discernible gap in correctness and citation quality, highlighting substantial room for improvement. Notably, the use of summarization and snippet generation as intermediate steps offered promising improvements in correctness, showing the potential for future research in optimizing information synthesis for citation.

The exploration also underscored the importance of advanced retrieval mechanisms and the ability of LLMs to effectively utilize longer contexts, posing them as essential areas for future research. The results suggest that further advancements in LLMs, specifically tailored toward better citation and use of retrieved information, could significantly enhance their utility in information-seeking tasks.

Conclusion

The ALCE benchmark represents a significant step forward in the pursuit of generating reliable and verifiable text with LLMs. Through comprehensive evaluations, it not only highlights the current strengths of LLMs in generating fluent and coherent responses but also points out critical areas requiring improvement, such as citation quality and factual correctness. This opens up new research directions aimed at developing better information retrieval integrations, long-context LLMs, and methodologies for effectively synthesizing information from multiple sources. As the field advances, ALCE will serve as an invaluable tool for measuring progress and encouraging innovation in the utilization of LLMs for informative and verifiable text generation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/UjjwalA_97/status/1788195778286047625

https://twitter.com/mrdrozdov/status/1810681012126453982

https://twitter.com/anh_ng8/status/1899236879960346631

YouTube

Show All Videos