Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with Large Language Models (2508.15396v1)

Published 21 Aug 2025 in cs.CL

Abstract: The increasing adoption of LLMs has been accompanied by growing concerns regarding their reliability and trustworthiness. As a result, a growing body of research focuses on evidence-based text generation with LLMs, aiming to link model outputs to supporting evidence to ensure traceability and verifiability. However, the field is fragmented due to inconsistent terminology, isolated evaluation practices, and a lack of unified benchmarks. To bridge this gap, we systematically analyze 134 papers, introduce a unified taxonomy of evidence-based text generation with LLMs, and investigate 300 evaluation metrics across seven key dimensions. Thereby, we focus on approaches that use citations, attribution, or quotations for evidence-based text generation. Building on this, we examine the distinctive characteristics and representative methods in the field. Finally, we highlight open challenges and outline promising directions for future work.

Summary

The paper introduces a unified taxonomy categorizing LLM evidence-based text generation into parametric and non-parametric attribution approaches.
It evaluates 300 diverse metrics for assessing citation accuracy, linguistic quality, and overall evidence verifiability in model outputs.
The study highlights trends like retrieval-augmented generation and calls for hybrid methods to improve trustworthiness and standard evaluation frameworks.

Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with LLMs

The paper "Attribution, Citation, and Quotation: A Survey of Evidence-based Text Generation with LLMs" (2508.15396) provides an exhaustive analysis of the landscape surrounding evidence-based text generation with LLMs. It examines the nuanced methodologies behind connecting model outputs to supporting evidence, ensuring trustworthiness and verifiability amidst growing concerns over LLM reliability.

Introduction

Recent progress in LLMs has yielded advances in language understanding and generation capabilities. However, these models are hindered by their propensity to generate hallucinations and their knowledge being limited to training data. This paper underscores the emergent need for evidence-based text generation, which emphasizes the production of text verifiable by explicit references to supporting documents. Despite the rising interest, this area lacks a unified terminology and standardized evaluation methodologies, contributing to a fragmented research landscape.

Unified Taxonomy and Key Dimensions

The authors propose a comprehensive taxonomy for evidence-based text generation, categorizing approaches into seven key attribution approaches, LLM integration strategies, citation characteristics, and task types.

Attribution Approaches are divided into parametric and non-parametric strategies. Parametric approaches rely on knowledge embedded within the model, while non-parametric approaches integrate external sources during inference:

Parametric: Incorporates internal LLM knowledge, often via model-centric adaptations like architectural adjustments.
Non-parametric: Utilizes external retrieval during inference, subdivided further into post-retrieval, in-generation, and in-context methods.
Figure 1: Taxonomy of evidence-based text generation with LLMs.

Evaluation Metrics and Frameworks

In addressing the fragmented landscape of evaluation metrics, the survey categorizes 300 evaluation metrics into dimensions such as attribution, correctness, citation, linguistic quality, preservation, relevance, and retrieval. The citation-related dimensions are particularly critical, as they assess whether generated text properly references the correct evidence sources. Evaluation methods include human evaluation, inference-based, and retrieval-based metrics, ensuring comprehensive assessment across multiple facets.

Figure 2: Evaluation metrics and frameworks for evidence-based text generation. The numbers in parentheses following each metric name indicate the number of studies that used the respective metric for evaluation. Metrics clustered by dashed lines are complementary metrics that should be used together for a comprehensive evaluation.

Trends and Challenges

The survey highlights trends such as the increasing reliance on retrieval-augmented generation (RAG) and the dominant role of prompting strategies over model retraining for guiding output attribution and correctness. It surfaces challenges related to integrating parametric attribution more robustly into LLMs to leverage their innate capacities without over-dependence on retrieval systems, which are often limited by their search capabilities.

Figure 3: Number of studies per year.

Implications and Future Directions

The paper identifies several future research directions, including the necessity for hybrid attribution methods that effectively combine parametric and non-parametric strategies to enhance LLM trustworthiness and output verifiability. Moreover, there is an urgent call for standardized evaluation frameworks to facilitate fair comparisons across methods, ensuring that future innovations are grounded against consistent benchmarks. Additionally, improved explainability for citation reasoning is becoming imperative for bias mitigation and enhanced user trust.

Conclusion

By systematically categorizing evidence-based text generation approaches and evaluation metrics, this paper serves as a crucial cornerstone for unifying and advancing ongoing research in the field. The insights and frameworks provided lay a foundation for enhancing the reliability and authenticity of LLM-generated content, emphasizing the critical role of evidence-linked outputs in the ongoing evolution of AI models.