Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 14 tok/s Pro
GPT-4o 90 tok/s Pro
Kimi K2 179 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Enabling Large Language Models to Generate Text with Citations (2305.14627v2)

Published 24 May 2023 in cs.CL, cs.IR, and cs.LG

Abstract: LLMs have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We develop automatic metrics along three dimensions -- fluency, correctness, and citation quality -- and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvement -- For example, on the ELI5 dataset, even the best models lack complete citation support 50% of the time. Our analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.

Citations (244)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces ALCE, a benchmark that evaluates LLMs’ ability to generate text with accurate citations, enhancing factual correctness.
  • It develops automatic metrics for fluency, correctness, and citation quality based on diverse datasets like Wikipedia and Reddit.
  • Experiments with models such as ChatGPT, GPT-4, and LLaMA highlight the critical role of retrieval quality and effective prompting strategies.

ALCE: Evaluating LLMs for Text Generation with Citations

The paper "Enabling LLMs to Generate Text with Citations" (2305.14627) introduces ALCE, a benchmark designed for automatically evaluating the ability of LLMs to generate text accompanied by relevant citations. It addresses the critical issue of hallucination in LLM-generated content by requiring models to provide citations for their claims, thereby enhancing factual correctness and verifiability. The authors compile a diverse set of datasets, develop automatic metrics, and conduct extensive experiments to assess the performance of state-of-the-art LLMs in this challenging task.

Task Formulation and Datasets

The ALCE benchmark formulates the task as follows: given a query qq and a corpus of text passages D\mathcal{D}, a system must generate an output S\mathcal{S} comprising nn statements s1,...,sns_1, ..., s_n, where each statement sis_i cites a list of passages Ci={ci,1,ci,2,...}\mathcal{C}_i = \{c_{i,1}, c_{i,2}, ...\}, with ci,jDc_{i,j} \in \mathcal{D}. The benchmark is built upon three datasets: ASQA, QAMPARI, and ELI5.

  • ASQA: A long-form factoid dataset using the 2018-12-20 Wikipedia snapshot as the corpus.
  • QAMPARI: A factoid QA dataset, also constructed from Wikipedia, where the answer is a list of entities drawn from different passages.
  • ELI5: A long-form QA dataset based on the Reddit forum "Explain Like I'm Five," using Sphere, a filtered version of Common Crawl, as the corpus.

(Figure 1)

Figure 1: The task setup of ALCE, showing how a system generates text with citations from a retrieval corpus in response to a question.

These datasets span a wide range of question types and corpora, making ALCE a comprehensive benchmark for evaluating LLMs' citation capabilities.

Automatic Evaluation Metrics

ALCE employs automatic evaluation metrics across three dimensions: fluency, correctness, and citation quality.

  • Fluency: Measured using MAUVE, ensuring the generated text is coherent.
  • Correctness: Assessed using tailored metrics for each dataset, such as exact match recall for ASQA and claim recall for ELI5, ensuring the answer is accurate and covers all relevant aspects.
  • Citation Quality: Evaluated through citation recall and precision, using a natural language inference (NLI) model to verify whether cited passages support the generated statements. Citation recall determines if the output is entirely supported by cited passages, while citation precision identifies irrelevant citations.

The combination of these metrics ensures a robust evaluation, preventing systems from exploiting shortcuts.

Modeling Approaches

The paper explores various modeling components for an ALCE system, including retrieval, synthesis, and post-editing.

  • Retrieval: Utilizes off-the-shelf dense retrievers like GTR and DPR for Wikipedia and BM25 for Sphere.
  • Synthesis: Focuses on prompting LLMs to synthesize and cite evidence, considering the limited context window of existing LLMs. Strategies include:
    • Vanilla prompting, where the model is provided with the top-kk passages and instructed to cite accordingly.
    • Summarization and Snippet prompting, which use summaries or snippets of passages to allow for more passages to fit within the context window.
    • Interactive prompting schemes, such as InlineSearch, which allow the model to call "search" during the generation process.
    • Closed-book baselines where the model generates answers without accessing any retrieved documents.
  • Post-editing: Employs strategies such as reranking and post-hoc citation to refine the output.

Experimental Results

Experiments were conducted using state-of-the-art LLMs, including ChatGPT, GPT-4, and LLaMA, along with various prompting strategies. The key findings include:

  • Vanilla prompting achieves strong performance despite its simplicity.
  • Using summaries or snippets improves correctness but may compromise citation quality.
  • Retrieving text on the fly does not consistently improve performance.
  • Reranking boosts citation quality, as validated by human evaluation.
  • Closed-book models with post-hoc citation deliver strong correctness but poor citation quality.

The results also indicate that GPT-4 exhibits limited improvement but is better at utilizing long context compared to ChatGPT.

Retrieval Analysis

The paper emphasizes the crucial role of retrieval quality, demonstrating that better retrievers lead to improved correctness and citation quality. However, the analysis reveals that LLMs often struggle to utilize accurate information present in the retrieved passages, indicating a limitation in their ability to synthesize information from multiple sources. Retrieval recall serves as an upper bound for model performance.

Human Evaluation

To validate the automatic evaluation metrics, the authors conducted human evaluations, revealing a strong correlation between the automatic metrics and human judgments. This confirms the reliability and effectiveness of ALCE as a benchmark for evaluating LLMs' citation capabilities.

Implications and Future Directions

The ALCE benchmark highlights several promising research directions:

  • Enhancing retrieval and refining retrieval integration in LLMs.
  • Developing LLMs with longer context windows.
  • Advancing LLMs' ability to synthesize information from multiple sources.

These directions extend beyond the ALCE setup, with implications for numerous applications of LLMs.

Conclusion

ALCE provides a valuable framework for evaluating and improving the ability of LLMs to generate text with citations, addressing a critical need for factual correctness and verifiability in LLM-generated content. The benchmark's comprehensive evaluation metrics, diverse datasets, and insightful analyses offer a strong foundation for future research in this area.

Youtube Logo Streamline Icon: https://streamlinehq.com