- The paper introduces ALCE, a benchmark that evaluates LLMs’ ability to generate text with accurate citations, enhancing factual correctness.
- It develops automatic metrics for fluency, correctness, and citation quality based on diverse datasets like Wikipedia and Reddit.
- Experiments with models such as ChatGPT, GPT-4, and LLaMA highlight the critical role of retrieval quality and effective prompting strategies.
ALCE: Evaluating LLMs for Text Generation with Citations
The paper "Enabling LLMs to Generate Text with Citations" (2305.14627) introduces ALCE, a benchmark designed for automatically evaluating the ability of LLMs to generate text accompanied by relevant citations. It addresses the critical issue of hallucination in LLM-generated content by requiring models to provide citations for their claims, thereby enhancing factual correctness and verifiability. The authors compile a diverse set of datasets, develop automatic metrics, and conduct extensive experiments to assess the performance of state-of-the-art LLMs in this challenging task.
The ALCE benchmark formulates the task as follows: given a query q and a corpus of text passages D, a system must generate an output S comprising n statements s1,...,sn, where each statement si cites a list of passages Ci={ci,1,ci,2,...}, with ci,j∈D. The benchmark is built upon three datasets: ASQA, QAMPARI, and ELI5.
- ASQA: A long-form factoid dataset using the 2018-12-20 Wikipedia snapshot as the corpus.
- QAMPARI: A factoid QA dataset, also constructed from Wikipedia, where the answer is a list of entities drawn from different passages.
- ELI5: A long-form QA dataset based on the Reddit forum "Explain Like I'm Five," using Sphere, a filtered version of Common Crawl, as the corpus.
(Figure 1)
Figure 1: The task setup of ALCE, showing how a system generates text with citations from a retrieval corpus in response to a question.
These datasets span a wide range of question types and corpora, making ALCE a comprehensive benchmark for evaluating LLMs' citation capabilities.
Automatic Evaluation Metrics
ALCE employs automatic evaluation metrics across three dimensions: fluency, correctness, and citation quality.
- Fluency: Measured using MAUVE, ensuring the generated text is coherent.
- Correctness: Assessed using tailored metrics for each dataset, such as exact match recall for ASQA and claim recall for ELI5, ensuring the answer is accurate and covers all relevant aspects.
- Citation Quality: Evaluated through citation recall and precision, using a natural language inference (NLI) model to verify whether cited passages support the generated statements. Citation recall determines if the output is entirely supported by cited passages, while citation precision identifies irrelevant citations.
The combination of these metrics ensures a robust evaluation, preventing systems from exploiting shortcuts.
Modeling Approaches
The paper explores various modeling components for an ALCE system, including retrieval, synthesis, and post-editing.
- Retrieval: Utilizes off-the-shelf dense retrievers like GTR and DPR for Wikipedia and BM25 for Sphere.
- Synthesis: Focuses on prompting LLMs to synthesize and cite evidence, considering the limited context window of existing LLMs. Strategies include:
- Vanilla prompting, where the model is provided with the top-k passages and instructed to cite accordingly.
- Summarization and Snippet prompting, which use summaries or snippets of passages to allow for more passages to fit within the context window.
- Interactive prompting schemes, such as InlineSearch, which allow the model to call "search" during the generation process.
- Closed-book baselines where the model generates answers without accessing any retrieved documents.
- Post-editing: Employs strategies such as reranking and post-hoc citation to refine the output.
Experimental Results
Experiments were conducted using state-of-the-art LLMs, including ChatGPT, GPT-4, and LLaMA, along with various prompting strategies. The key findings include:
- Vanilla prompting achieves strong performance despite its simplicity.
- Using summaries or snippets improves correctness but may compromise citation quality.
- Retrieving text on the fly does not consistently improve performance.
- Reranking boosts citation quality, as validated by human evaluation.
- Closed-book models with post-hoc citation deliver strong correctness but poor citation quality.
The results also indicate that GPT-4 exhibits limited improvement but is better at utilizing long context compared to ChatGPT.
Retrieval Analysis
The paper emphasizes the crucial role of retrieval quality, demonstrating that better retrievers lead to improved correctness and citation quality. However, the analysis reveals that LLMs often struggle to utilize accurate information present in the retrieved passages, indicating a limitation in their ability to synthesize information from multiple sources. Retrieval recall serves as an upper bound for model performance.
Human Evaluation
To validate the automatic evaluation metrics, the authors conducted human evaluations, revealing a strong correlation between the automatic metrics and human judgments. This confirms the reliability and effectiveness of ALCE as a benchmark for evaluating LLMs' citation capabilities.
Implications and Future Directions
The ALCE benchmark highlights several promising research directions:
- Enhancing retrieval and refining retrieval integration in LLMs.
- Developing LLMs with longer context windows.
- Advancing LLMs' ability to synthesize information from multiple sources.
These directions extend beyond the ALCE setup, with implications for numerous applications of LLMs.
Conclusion
ALCE provides a valuable framework for evaluating and improving the ability of LLMs to generate text with citations, addressing a critical need for factual correctness and verifiability in LLM-generated content. The benchmark's comprehensive evaluation metrics, diverse datasets, and insightful analyses offer a strong foundation for future research in this area.