LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-context QA (2409.02897v3)

Published 4 Sep 2024 in cs.CL

Abstract: Though current long-context LLMs have demonstrated impressive capacities in answering user questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to their potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations, improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs' performance in Long-Context Question Answering with Citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically generate long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the LongCite-45k dataset, successfully enabling their generation of accurate responses and fine-grained sentence-level citations in a single output. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o.

PDF Abstract

An Analytical Overview of "LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-Context QA"

The paper entitled "LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-Context QA" introduces a robust framework to enhance the capabilities of long-context LLMs in generating fine-grained, sentence-level citations during long-context question answering (QA). This initiative is driven by the necessity to improve the verifiability and trustworthiness of LLM responses, aiming to address concerns over hallucinations and the lack of transparency in AI-generated content.

Methodological Innovations

The authors propose a novel pipeline named CoF ("Coarse to Fine") dedicated to constructing a high-quality, large-scale supervised fine-tuning (SFT) dataset for long-context QA with citations (LQAC). The CoF pipeline automates the generation of QA instances with sentence-level citations through four distinct processes:

QA Instance Generation: Utilizing self-instruct paradigms, the LLM generates questions and corresponding answers from lengthy text inputs.
Chunk-level Citation Generation: Chunks of texts are retrieved using the answer's content, onto which coarse-grained citations are added.
Sentence-level Citation Extraction: This involves refining the granularity of citations from chunk-level to sentence-level, ensuring precision and clarity.
Data Filtering: This step improves the quality of the dataset by filtering out instances with few citations.

Construction of LongCite-45k

Applying this pipeline, the authors constructed LongCite-45k, an extensive dataset containing 44,600 LQAC instances. This dataset spans 50,000 documents across multiple domains, showcasing the framework's versatility and potential applicability to a wide range of text sources and topics.

Model Training and Evaluation

Two open-source models, GLM-4-9B and Llama-3.1-8B, were fine-tuned using LongCite-45k. The resulting models, named LongCite-9B and LongCite-8B, were benchmarked on LongBench-Cite, showcasing state-of-the-art performance in citation quality—outperforming proprietary solutions such as GPT-4o by notable margins in terms of citation F1 scores.

One of the salient findings of this work is that SFT on LQAC data not only enhances citation quality but also augments the response correctness. This dual enhancement—accuracy in answers and precision in citations—demonstrates the potential of the CoF pipeline to mitigate hallucinations and foster more comprehensive information utilization by LLMs.

Implications and Future Directions

The implications of this research are twofold. Practically, improving the citation capabilities of LLMs could significantly increase their adoption in sensitive domains requiring verifiable content, such as law or academia. Theoretically, this work underscores the importance of citation granularity and context utilization, which could inspire future explorations into even finer-grained citation strategies or new benchmarks assessing LLM trustworthiness.

Future developments might focus on refining the CoF pipeline to reduce training costs further or integrate adaptive retrieval mechanisms that fine-tune the balance between retrieval breadth and specificity. Additionally, exploring how these advancements translate to multilingual or cross-lingual contexts would be beneficial.

In conclusion, "LongCite" presents a well-structured approach towards improving LLM functionalities in the field of long-context QA by embedding citation mechanisms that address user trust and model transparency, setting a new paradigm for thorough AI model training and evaluation in extended contexts.