Enabling LLMs to Generate Text with Citations
Introduction
The capabilities of LLMs extend to various use cases, including information retrieval and text generation. However, a significant challenge that persists is the tendency of these models to generate content that may not always be factually accurate, termed as "hallucination". To address this, a new paradigm that mandates LLMs to generate text with citations has emerged. This approach not only enhances factual correctness but also ensures verifiability of the generated outputs. In the light of this development, a benchmark named ALCE (Automatic LLMs' Citation Evaluation) has been designed to evaluate the efficacy of LLMs in generating text with citations, marking a substantial advancement in the field.
The ALCE Benchmark
ALCE stands as the first of its kind, a reproducible benchmark tailored for the automatic evaluation of text generation by LLMs with an emphasis on citation quality. The benchmark introduces a novel task that encompasses a natural-language question and a retrieval corpus, mandating an end-to-end system capable of retrieving relevant passages, generating a response, and appropriately citing the supporting passages. It incorporates innovative automatic metrics across three dimensions -- fluency, correctness, and citation quality -- and exhibits a strong correlation with human judgment. The benchmark encompasses various datasets like ASQA, QAMPARI, and ELI5, each presenting unique challenges and emphasizing the importance of citation in generated text.
Evaluation Metrics
ALCE employs a multifaceted approach to evaluation, measuring:
- Fluency through MAUVE, ensuring generated text is coherent and well-formed.
- Correctness through tailored metrics such as EM recall for ASQA and claim recall for ELI5, assessing the accuracy and coverage of generated responses.
- Citation Quality through metrics that evaluate both the recall and precision of citations in the generated text, assessed automatically using an NLI model.
This comprehensive evaluation framework ensures that improvements on the ALCE benchmark correlate with enhancements in the practical utility and reliability of LLM-generated content.
Modeling Approaches and Future Directions
Various modeling strategies were explored, from employing dense retrievers like GTR for information retrieval to novel prompting strategies aimed at better synthesizing and citing information. While all systems demonstrated proficiency in generating fluent responses, there was a discernible gap in correctness and citation quality, highlighting substantial room for improvement. Notably, the use of summarization and snippet generation as intermediate steps offered promising improvements in correctness, showing the potential for future research in optimizing information synthesis for citation.
The exploration also underscored the importance of advanced retrieval mechanisms and the ability of LLMs to effectively utilize longer contexts, posing them as essential areas for future research. The results suggest that further advancements in LLMs, specifically tailored toward better citation and use of retrieved information, could significantly enhance their utility in information-seeking tasks.
Conclusion
The ALCE benchmark represents a significant step forward in the pursuit of generating reliable and verifiable text with LLMs. Through comprehensive evaluations, it not only highlights the current strengths of LLMs in generating fluent and coherent responses but also points out critical areas requiring improvement, such as citation quality and factual correctness. This opens up new research directions aimed at developing better information retrieval integrations, long-context LLMs, and methodologies for effectively synthesizing information from multiple sources. As the field advances, ALCE will serve as an invaluable tool for measuring progress and encouraging innovation in the utilization of LLMs for informative and verifiable text generation.