Fact-Aware Neural Abstractive Summarization

Updated 25 December 2025

The paper presents methods that integrate fact-conditioned dual encoders, graph-enhanced architectures, and QA-augmented models to reduce hallucinations in generated summaries.
It employs factuality metrics like FactCC, DAE, and QA-based scores to ensure that all generated propositions are supported by the source document.
These approaches balance abstractiveness with factual precision, providing scalable strategies for generating concise, faithful, and informative summaries.

Fact Aware Neural Abstractive Summarization refers to neural summarization methods specifically designed to produce abstractive summaries whose information is semantically faithful (“factually consistent”) with respect to the source input. Unlike extractive summarization, which selects and copies sentences or phrases, abstractive summarization synthesizes novel sentences, increasing the risk that the generated output will contain hallucinated or misrepresented facts. Fact-aware methods directly address this by integrating mechanisms at the modeling, training, or decoding stages that steer the system to avoid unsupported or contradicted content while maintaining summary quality and informativeness.

1. Definition and Scope of Factual Consistency

Factual consistency (or factuality) in neural abstractive summarization formalizes the requirement that all atomic propositions and entities expressed in the generated summary are entailed by, or directly supported by, the source document. Formally, for a summary $S$ of document $D$ , factual consistency requires $F(S) \subseteq F(D)$ , where $F(\cdot)$ extracts the set of facts (relations, entities, propositions) from a text (Huang et al., 2021, Cao, 2022). Violations are typically categorized as:

Intrinsic errors (contradiction): Factual claims in $S$ are expressly contradicted by $D$ .
Extrinsic errors (hallucination): Factual claims in $S$ are not present in—and not entailed by— $D$ .

Automatic assessment of factual consistency involves semantic metrics (e.g., FactCC, DAE, QA-based scores) rather than surface-n-gram overlaps, since measures like ROUGE are insensitive to factual manipulation (Huang et al., 2021, Cao, 2022).

2. Fact-Aware Summarization Architectures and Strategies

Multiple modeling paradigms for fact-aware neural abstractive summarization have been proposed:

2.1 Dual-Input or Fact-Conditioned Encoders

Early approaches—such as FTSum—extract fact descriptions from the source via OpenIE and dependency parsing, then use dual encoders and dual attention in the decoder to blend sentence-level and fact-level context at each summary token generation step. A gating mechanism dynamically weights fact vs. text features, resulting in large empirical reductions in “fake summaries” on Gigaword (from 27% to 6%) without informational loss (Cao et al., 2017).

2.2 Graph-Enhanced Architectures

Graph-based encodings utilize OpenIE or external knowledge bases (e.g., Wikidata) to derive document-level relational graphs. Models such as FASum (Zhu et al., 2020) and “Mind the Facts” (Gunel et al., 2020) embed these nodes (entities, relations) and fuse them via cross-attention into the summary decoder. Integration of factual graphs via GATs demonstrably raises factual consistency (FactCC, RMR scores) and mitigates errors such as entity or number hallucination.

2.3 QA-Augmented Models

QA-enhanced models (FES) incorporate question-answering to enforce encoder understanding: entity-centric QA is used alongside graph attention, and decoder-level KL divergence aligns QA-derived importance weights to summary attention. Additionally, a margin loss penalizes the decoder for favoring fluency from the pre-trained LM over faithfulness to the source (Chen et al., 2022).

2.4 Post-Editing and Correction Modules

Factual correction may be performed in post-processing, either via span-based QA models (SpanFact), which iteratively replace dubious entities by extracting document-supported spans (Dong et al., 2020), or via conditional generative models (BART-based correctors) pre-trained on synthetic errors (entity/number/date/pronoun swaps) (Cao et al., 2020). These systems operate on arbitrary system outputs and are modular but constrained by their span coverage and the diversity of error generation.

2.5 Contrastive and Reward-Based Training

Recent methods apply contrastive learning or reward learning to directly optimize for factuality:

Contrastive losses (e.g., CO2Sum, EFACTSUM, FactPEGASUS) penalize the model for scoring factually incorrect (perturbed or negative) candidates above factual (reference or FactCC-verified) summaries, via pairwise margin-based objectives (Liu et al., 2021, Dixit et al., 2023, Wan et al., 2022).
Contrastive reward learning (CRL) fine-tunes summarizers to rank higher those candidates with better factuality metrics (e.g., FactCC, BARTScore, DAE) through a pairwise contrastive loss over pools generated by diverse beam search, combining the cross-entropy and contrastive terms (Chern et al., 2023).

These methods yield large gains in factual metrics (up to +11 FactCC on CNN/DM) while maintaining or mildly reducing ROUGE (Dixit et al., 2023, Chern et al., 2023).

2.6 Faithfulness-Aware Decoding

Faithfulness-aware decoding strategies include beam candidate reranking by factual metrics and lookahead heuristics that anticipate future factual consistency, optionally distilled into a student model for efficiency (Wan et al., 2023). These are orthogonal to architectural modifications and provide substantial faithfulness improvements with or without retraining.

3. Factuality Metrics and Evaluation Protocols

The gold standard for factual consistency assessment is human judgment, but large-scale research employs automatic (pseudo-reference-free) metrics:

FactCC: BERT-based binary classifier trained on synthetic entailment/non-entailment pairs (entity, number, negation swaps) (Huang et al., 2021, Wan et al., 2022).
DAE (Dependency Arc Entailment): Proportion of summary dependency arcs semantically entailed by the source (Dreyer et al., 2021).
QA-based (QAGS, FEQA, QuestEval): Generate Q-A pairs from summary, answer over both summary and source, compare answer similarity (Huang et al., 2021, Wan et al., 2022).
Relation Matching (RMR): Overlap of OpenIE triples between summary and source (Zhu et al., 2020).
BERTScore-Fact: BERT precision focused on factual subsentences (Wan et al., 2023).

Datasets such as CNN/DM and XSUM dominate experimental protocols. Benchmarks like ConstraintsFact and ModelsFact provide systematically annotated factuality at varying degrees of summary abstractiveness for robust comparisons (Dreyer et al., 2021).

4. Mathematical Objectives and Training Regimes

Fact-aware summarization training objectives typically augment the standard cross-entropy loss with additional terms:

Contrastive Loss (generalized):

$\mathcal{L}_{\mathrm{CL}} = \sum_{i=1}^{2m}\sum_{j=i+1}^{2m}\max\left(0, f(S_j) - f(S_i) + d_{ij}\right)$

where $f(S)$ is the normalized log-probability of candidate summary $S$ , and $d_{ij}$ encodes rank-based margin. Selection and ordering of candidates is determined by their factuality (e.g., FactCC) and similarity (e.g., ROUGE-L) (Dixit et al., 2023).

Contrastive Reward Learning:

$L_{\mathrm{ctr}}(\theta) = \sum_{i=1}^k \sum_{j=i+1}^k \max(0, f_\theta(S_j) - f_\theta(S_i) + \Delta_{ij})$

where ranking is by factuality metrics rather than gold reference (Chern et al., 2023).

Multi-objective Losses: Models often blend cross-entropy, contrastive, and coverage or margin losses, tuned to prevent trade-offs resulting in extractiveness or excessive abstraction (Chen et al., 2022, Wan et al., 2022).

Hyperparameters are tuned to balance faithfulness and quality (e.g., $\lambda = 0.1$ to $100$ for contrastive term weight; $\delta = 0.001$ for margin) (Dixit et al., 2023, Chern et al., 2023).

5. Abstractiveness–Factuality Trade-off and Systemic Findings

Empirical work reveals a trade-off: increased abstractiveness (novelty) of generated text correlates with higher rates of factual errors—though the precise slope is dataset-dependent (e.g., CNN/DM shows gradual linear decay; XSum, much steeper) (Dreyer et al., 2021). Factuality-improving systems must be benchmarked on both axes to avoid misleading gains via extractiveness. Contemporary systems (e.g., EFACTSUM, FactPEGASUS, FASum) demonstrate that it is possible to substantially raise faithfulness without simply copying content, as validated by adjusted metrics (e.g., MINT, $\mu$ -Fact) (Dixit et al., 2023, Wan et al., 2022, Dreyer et al., 2021).

6. Limitations, Open Challenges, and Future Directions

Despite major advances, key challenges persist:

Metric fidelity: Automatic metrics (even the best, e.g., QAGS, FactCC, BERTScore) achieve only moderate correlation ( $r\approx0.4{-}0.5$ ) with human factuality judgment (Huang et al., 2021, Cao, 2022).
Synthetic–real gap: Post-hoc correction models trained on synthetic error patterns generalize poorly to the semantic diversity and subtlety of real system hallucinations (Cao et al., 2020, Dong et al., 2020).
Broader scope: Most approaches focus on named entities and relations; consistent modeling of events, quantities, or multi-sentence references, as well as external and commonsense knowledge integration, remains limited (Chen et al., 2022, Gunel et al., 2020).
Computation and efficiency: Many state-of-the-art methods (e.g., contrastive learning, faithfulness-aware decoding) add significant computational overhead; strategies for fast inference, such as distillation of faithfulness improvements into student models, are active research areas (Wan et al., 2023).
Domain generalization: Faithfulness metrics and correction models are predominantly validated in news domains; extension to scientific, medical, or large-scale multi-document summarization requires new annotation and adaptation.

A plausible implication is that future progress will require more robust factuality metrics, tighter integration of external knowledge, and advances in scalable, low-latency fact-aware training and inference regimes (Dixit et al., 2023, Wan et al., 2022, Chern et al., 2023, Huang et al., 2021).

7. Representative Methods: Comparative Table

Method	Core Technique	Factuality Gain (vs. Base)	Abstractiveness Impact	Source Models
EFACTSUM (Dixit et al., 2023)	Candidate ranking + margin-contrastive loss	+6 FactCC (XSUM), +11 FactCC (CNN/DM)	No extractiveness increase	PEGASUS, BART
SpanFact (Dong et al., 2020)	Post-editing via QA-based span correction	+2–5 QAGS/FactCC	–<1 ROUGE	Any backbone
FASum (Zhu et al., 2020)	Graph attention over OpenIE triples	+0.6 FactCC, +11.2 RMR₁	Small tradeoff	Transformer
FactPEGASUS (Wan et al., 2022)	Factuality-enhanced pretraining, corrector, contrastive fine-tune	+43% FactCC (XSUM)	Maintains abstraction	PEGASUS
CRL (Chern et al., 2023)	Contrastive reward ranking via factuality metrics	Human-FAC ≈99% (CNN/DM)	Some ROUGE loss	BART, PEGASUS
CO2Sum (Liu et al., 2021)	Encoder and decoder contrastive losses	+3 QAGS, +8 OpenIE	–0.8 ROUGE	BART
Faithfulness-aware Decoding (Wan et al., 2023)	Metric-based beam rerank/lookahead	+7–17 FactCC/DAE	↑ extractiveness (controllable)	BART, PEGASUS

These results confirm that candidate-based reranking, factual constraint at both training and decoding, and hybrid losses are synergistic in maximizing summary factuality without incurring significant losses in output relevance or abstraction. Evaluation should report both factual and overlap-based metrics to ensure gains are not due to extractive degeneration (Dixit et al., 2023, Wan et al., 2023, Dreyer et al., 2021).