Emergent Mind

Enabling Large Language Models to Generate Text with Citations

Published May 24, 2023 in cs.CL , cs.IR , and cs.LG


Large language models (LLMs) have emerged as a widely-used tool for information seeking, but their generated outputs are prone to hallucination. In this work, our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability. Existing work mainly relies on commercial search engines and human evaluation, making it challenging to reproduce and compare different modeling approaches. We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation. ALCE collects a diverse set of questions and retrieval corpora and requires building end-to-end systems to retrieve supporting evidence and generate answers with citations. We develop automatic metrics along three dimensions -- fluency, correctness, and citation quality -- and demonstrate their strong correlation with human judgements. Our experiments with state-of-the-art LLMs and novel prompting strategies show that current systems have considerable room for improvement -- For example, on the ELI5 dataset, even the best models lack complete citation support 50% of the time. Our analyses further highlight promising future directions, including developing better retrievers, advancing long-context LLMs, and improving the ability to synthesize information from multiple sources.
ALCE setup: System generates text for a question and provides citing passages.


  • The paper introduces a new paradigm that requires LLMs to generate text with citations, addressing the challenge of factual accuracy through the ALCE benchmark.

  • ALCE is a novel benchmark designed to evaluate LLMs on text generation with citations using metrics that assess fluency, correctness, and citation quality, showing a strong correlation with human judgment.

  • Various modeling strategies, including the use of dense retrievers and novel prompting strategies, were explored to enhance citation quality and correctness, indicating substantial room for improvement.

  • The ALCE benchmark marks a significant step towards generating reliable and verifiable text by LLMs, highlighting the need for advancements in information retrieval, long-context understanding, and synthesis from multiple sources.

Enabling LLMs to Generate Text with Citations


The capabilities of LLMs extend to various use cases, including information retrieval and text generation. However, a significant challenge that persists is the tendency of these models to generate content that may not always be factually accurate, termed as "hallucination". To address this, a new paradigm that mandates LLMs to generate text with citations has emerged. This approach not only enhances factual correctness but also ensures verifiability of the generated outputs. In the light of this development, a benchmark named ALCE (Automatic LLMs' Citation Evaluation) has been designed to evaluate the efficacy of LLMs in generating text with citations, marking a substantial advancement in the field.

The ALCE Benchmark

ALCE stands as the first of its kind, a reproducible benchmark tailored for the automatic evaluation of text generation by LLMs with an emphasis on citation quality. The benchmark introduces a novel task that encompasses a natural-language question and a retrieval corpus, mandating an end-to-end system capable of retrieving relevant passages, generating a response, and appropriately citing the supporting passages. It incorporates innovative automatic metrics across three dimensions -- fluency, correctness, and citation quality -- and exhibits a strong correlation with human judgment. The benchmark encompasses various datasets like ASQA, QAMPARI, and ELI5, each presenting unique challenges and emphasizing the importance of citation in generated text.

Evaluation Metrics

ALCE employs a multifaceted approach to evaluation, measuring:

  • Fluency through MAUVE, ensuring generated text is coherent and well-formed.
  • Correctness through tailored metrics such as EM recall for ASQA and claim recall for ELI5, assessing the accuracy and coverage of generated responses.
  • Citation Quality through metrics that evaluate both the recall and precision of citations in the generated text, assessed automatically using an NLI model.

This comprehensive evaluation framework ensures that improvements on the ALCE benchmark correlate with enhancements in the practical utility and reliability of LLM-generated content.

Modeling Approaches and Future Directions

Various modeling strategies were explored, from employing dense retrievers like GTR for information retrieval to novel prompting strategies aimed at better synthesizing and citing information. While all systems demonstrated proficiency in generating fluent responses, there was a discernible gap in correctness and citation quality, highlighting substantial room for improvement. Notably, the use of summarization and snippet generation as intermediate steps offered promising improvements in correctness, showing the potential for future research in optimizing information synthesis for citation.

The exploration also underscored the importance of advanced retrieval mechanisms and the ability of LLMs to effectively utilize longer contexts, posing them as essential areas for future research. The results suggest that further advancements in LLMs, specifically tailored toward better citation and use of retrieved information, could significantly enhance their utility in information-seeking tasks.


The ALCE benchmark represents a significant step forward in the pursuit of generating reliable and verifiable text with LLMs. Through comprehensive evaluations, it not only highlights the current strengths of LLMs in generating fluent and coherent responses but also points out critical areas requiring improvement, such as citation quality and factual correctness. This opens up new research directions aimed at developing better information retrieval integrations, long-context LLMs, and methodologies for effectively synthesizing information from multiple sources. As the field advances, ALCE will serve as an invaluable tool for measuring progress and encouraging innovation in the utilization of LLMs for informative and verifiable text generation.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.