Papers
Topics
Authors
Recent
Search
2000 character limit reached

Extractive Evaluation Techniques

Updated 27 January 2026
  • Extractive Evaluation Techniques are a set of methods that quantify summary quality by measuring semantic relevance, redundancy control, and faithfulness through detailed metrics such as Sem-nCG, FAR, SAR, and ExtEval.
  • The approach generalizes classical metrics with semantic gain and facet-aware analyses, achieving improved correlation with human judgments and providing a fine-grained diagnostic of extractive summarization outputs.
  • These techniques not only highlight limitations of lexical overlap metrics but also pave the way for future adaptations in both extractive and abstractive summarization through adaptive weighting and enhanced error detection.

Extractive evaluation techniques are a class of methodologies and metrics developed to assess the quality and properties of extractive summarization systems. These techniques address limitations of standard lexical-overlap metrics by incorporating semantic coverage, redundancy control, information facet coverage, and faithfulness, ensuring a more holistic and fine-grained evaluation. Recent work has produced principled metrics that address challenges such as ranking quality, adaptability to multiple references, redundancy sensitivity, and error-specific faithfulness assessment.

1. Semantically and Gain-based Evaluation: Redundancy-Aware Sem-nCG

The redundancy-aware Sem-nCG metric generalizes normalized cumulative gain (nCG) from information retrieval to semantic evaluation of extractive summarization. Given a document DD (sentences S1,,SnS_1,\dots,S_n) and a set of reference summaries R1,,RmR_1,\dots,R_m, each sentence’s semantic relevance to the references is computed (e.g., with cosine similarity of STSb embeddings), forming an “ideal” sentence ranking for the document.

For a summary of kk extracted sentences XX, cumulative gain CG@k\mathrm{CG}@k is the sum of relevance scores of its top-kk sentences; ICG@k\mathrm{ICG}@k is the total semantic gain from the ideal ranking. The normalized gain is: Sem-nCG@k=CG@kICG@k.\text{Sem-nCG}@k = \frac{\mathrm{CG}@k}{\mathrm{ICG}@k}. To penalize redundancy, a diversity-controlled penalty Scorered(X)\text{Score}_\mathrm{red}(X) is applied: Scorered(X)=1Xi=1XmaxjiSim(xi,xj)\text{Score}_\mathrm{red}(X) = \frac{1}{|X|}\sum_{i=1}^{|X|} \max_{j\neq i} \text{Sim}(x_i,x_j) where Sim\text{Sim} can be any sentence similarity, such as ROUGE, BERTScore, or cosine similarity.

A tunable parameter λ[0,1]\lambda\in[0,1] trades off importance and diversity: Final Score(X)=λSem-nCG@k+(1λ)(1Scorered(X))\text{Final Score}(X) = \lambda \cdot \text{Sem-nCG}@k + (1-\lambda)(1-\text{Score}_\mathrm{red}(X)) With λ=0.5\lambda=0.5, the metric demonstrates improved correlation with human judgments of Coherence (+14% Kendall’s τ\tau) and Relevance (+5% τ\tau) over both ROUGE and the original Sem-nCG. The metric supports multi-reference evaluation by averaging relevance across reference summaries, using ensemble similarity or rank-aggregation strategies to derive a unified ideal ranking (Akter et al., 2023).

2. Facet-Aware Coverage and Support Metrics

The facet-aware evaluation framework decomposes a reference summary into information “facets”—one per sentence—and determines which subsets of document sentences (support groups) collectively cover the semantics of each facet. With gold or automatically identified facet-support mappings, evaluation proceeds by checking the extracted summary’s coverage:

  • Facet-Aware Recall (FAR): Fraction of facets covered by any support group fully included in the extract.

FAR(E)=1Ri=1Rmaxj=1NiI(Sji,E)\text{FAR}(\mathcal{E}) = \frac{1}{R}\sum_{i=1}^{R} \max_{j=1}^{N_i} I(\mathcal{S}_j^i, \mathcal{E})

  • Support-Aware Recall (SAR): Fraction of all support sentences present in the system extract.

Facet-aware evaluation enables detailed analyses (e.g., distinguishing summary redundancy from true information coverage). FAR demonstrates better correlation with human rankings than ROUGE-1 F1 (ρ ≈ 0.457 versus 0.44) and supports comparative and fine-grained diagnostic analysis (Mao et al., 2019).

3. Faithfulness Detection: Error-Typology and Automated Metrics

Extractive summaries can exhibit multiple forms of unfaithfulness, despite directly copying from the source. Five fine-grained error types are labeled:

  1. Incorrect Coreference: Anaphora refer incorrectly to antecedents from the original.
  2. Incomplete Coreference: Anaphora in the summary lack clear antecedents.
  3. Incorrect Discourse: Discourse linkers imply false relations due to extraction boundaries.
  4. Incomplete Discourse: Discourse units or connectives are incomplete, confusing interpretation.
  5. Other Misleading Information: Content selection biases or misleads beyond strict entailment.

The ExtEval metric detects these errors through four sub-detectors. It runs SpanBERT-based coreference resolution, explicit discourse marker identification, and sentiment difference calculation (RoBERTa). ExtEval is computed as the sum of binary indicators for each error (plus sentiment gap): ExtEval(S,D)=f1+f2+f3+f4\mathrm{ExtEval}(S,D) = f_1 + f_2 + f_3 + f_4 Where each fif_i proxies a specific error subtype. On system-level ranking, ExtEval achieves Pearson r0.96r \approx 0.96 and Spearman ρ0.88\rho \approx 0.88 with human-judged faithfulness—substantially outperforming ROUGE, FactCC, and QuestEval (Zhang et al., 2022).

4. Extractive Fragment Coverage and Density

To analyze how extractive a summary is, two canonical techniques are defined:

  • Extractive Fragment Coverage (Coverage): The fraction of summary tokens that appear as contiguous fragments in the article.

Coverage(A,S)=1SfF(A,S)f\text{Coverage}(A,S) = \frac{1}{|S|}\sum_{f\in\mathcal{F}(A,S)} |f|

This quantifies literal copying; Coverage ≈ 1 for fully extractive summaries.

  • Extractive Fragment Density (Density): The average squared length of each fragment per summary token.

Density(A,S)=1SfF(A,S)f2\text{Density}(A,S) = \frac{1}{|S|}\sum_{f\in\mathcal{F}(A,S)} |f|^2

Higher Density indicates longer contiguous copied spans. These metrics distinguish between summaries with similar Coverage but differing composition (e.g., many single-word vs. a few long fragment matches). Newsroom analyses show wide variance in both coverage and density across publications and relative to other datasets like CNN/DailyMail (Grusky et al., 2018).

Metric Definition Interprets
Coverage Fraction of tokens in fragments Extractiveness
Density Mean squared fragment length/token Extractive grouping

5. Experimental Protocols, Correlation, and Analysis

Evaluation of extractive metrics follows structured protocols:

  • Correlating single/multi-reference metric scores to expert human ratings on dimensions such as Consistency, Relevance, Coherence, and Fluency.
  • For redundancy-aware Sem-nCG, λ=0.5 yields robust trade-offs; multi-reference strategies (Ensembleₛᵢₘ, Ensembleᵣₑₗ) stabilize ranking compared to per-ref metric averaging (Akter et al., 2023).
  • Facet-aware experiments highlight that systems may achieve high support retrieval (SAR) while exhibiting localized redundancy, underscoring the need for nuanced evaluation (Mao et al., 2019).
  • Faithfulness metrics reveal that even ROUGE-oracle summaries are frequently unfaithful; ExtEval enables targeted reranking or automatic post-editing to repair coreference/discourse errors (Zhang et al., 2022).

6. Strengths, Limitations, and Future Directions

Extractive evaluation techniques now address multiple desiderata: semantic coverage, redundancy control, ranking quality, and faithfulness. Innovations such as redundancy-aware Sem-nCG and facet-aware recall yield better alignment with human preferences and uncover properties invisible to standard overlap-based metrics. All such methods have structural limitations:

  • Dependency on embedding/similarity model calibration; robustness on highly abstractive or paraphrased content is limited.
  • λ or threshold hyperparameters typically require domain-specific tuning.
  • Most approaches currently evaluate only extractive summaries or datasets with clear extractive structure.
  • Annotation and mapping of facets/support groups for evaluation are resource-intensive; automatic methods improve precision but often sacrifice recall.

Potential future directions include learned, instance-adaptive weighting between importance and diversity, extending redundancy penalties and faithfulness detection to generative (abstractive) summaries, and incorporating factuality verification or reference-free evaluation modules (Akter et al., 2023, Mao et al., 2019, Zhang et al., 2022).


In summary, contemporary extractive evaluation combines semantic gain, redundancy control, facet coverage, and error-typology-based faithfulness scoring, supplementing and superseding lexical overlap metrics for systematic, fine-grained, and human-aligned assessment of extractive summarization systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Extractive Evaluation Techniques.