Papers
Topics
Authors
Recent
Search
2000 character limit reached

GUM-SAGE: Graded Entity Salience Model

Updated 3 April 2026
  • GUM-SAGE is a dataset and methodology that quantifies entity prominence on a 0–5 scale by aligning multiple human and LLM-generated summaries.
  • It employs both Stanza coreference and an ensemble logistic regression model to robustly predict graded entity salience without relying on contextual token embeddings.
  • Empirical results show significant improvements over baseline methods, setting new benchmarks in Spearman’s correlation and top-k retrieval for salient entities.

GUM-SAGE is a dataset and methodology for graded entity salience prediction in English documents. Graded entity salience quantifies entity importance on a discrete 0–5 scale, reflecting the relative prominence of entities within a text based on their inclusion in multiple reference summaries. The GUM-SAGE framework combines human and LLM-generated one-sentence summaries, robust entity–summary alignment, and lightweight model architectures to establish new benchmarks in graded salience correlation, substantially surpassing both position-based heuristics and few-shot prompting of LLMs (Lin et al., 15 Apr 2025).

1. Dataset Construction

GUM-SAGE is constructed atop the Universal Dependencies English GUM corpus, encompassing 213 documents from 12 genres: academic, biography, conversation, fiction, interview, news, reddit, speech, textbook, vlog, voyage, and wikihow. The corpus contains approximately 203,781 tokens, 57,360 mentions, and 32,300 distinct entities (both named and non-named), averaging 148 entities per document.

Each document is paired with five aligned summaries:

  1. One “gold” expert-written summary, with 24 test documents receiving a second gold summary.
  2. Four “silver” summaries (training) generated by prompting off-the-shelf LLMs (GPT-4o, Claude 3.5 Sonnet, Llama 3.2 3B Instruct, Qwen 2.5 7B Instruct) for single-sentence, ≤380 character summaries.
  3. Four additional human summaries (development/test) crowdsourced from graduate students under strict guidelines for consistency and factuality.

This results in exactly five summaries per document across all splits. Entity salience scores are computed as:

S(e)  =  j=151(esummaryj)S(e)\;=\;\sum_{j=1}^{5}\mathbf{1}\bigl(e\in \text{summary}_j\bigr)

where S(e){0,1,2,3,4,5}S(e)\in\{0,1,2,3,4,5\}. An entity is non-salient if S(e)=0S(e)=0, with S(e)=1S(e)=1–5 interpreted as a graded measure of salience.

The corpus exhibits genre-related variation: on average, 13.8% of entities per document have nonzero salience, rising to 32.9% in wikihow and dropping to 6.3% in academic texts, highlighting strong genre dependence in entity inclusion within summaries.

2. Graded Entity Salience Modeling Approaches

GUM-SAGE evaluates two primary task-specific modeling approaches:

Stanza Coreference Model

Documents concatenated with each summary are processed using the Stanza Coreference system (XLM-RoBERTa-large, trained on CorefUD). For a given summary, any mention sharing a coreference cluster with a document mention is considered “present.” Each entity's summary-level presence is then summed to yield its 050\text{–}5 salience score. No additional neural layers or task-specific fine-tuning are used.

Ensemble Logistic-Regression Model

This method constructs a feature vector x(e,j)\mathbf{x}(e,j) for each entity-summary pair, composed of:

  • Three binary alignment-module flags:
    • String match (exact or high-overlap)
    • Stanza coreference
    • GPT-4o prompt-based match
  • Linguistic features from GUM:
    • One-hot entity type (Person, Organization, Abstract, etc.)
    • One-hot genre (12 genres)
    • Normalized document position of first mention [0,1][0,1]

A logistic regression classifier predicts entity-summary alignment probability:

pj(e)=σ(wx(e,j)+b),σ(z)=11+ezp_{j}(e) = \sigma (\mathbf{w}^\top \mathbf{x}(e,j) + b),\quad \sigma(z)=\frac{1}{1+e^{-z}}

The model, trained with binary cross-entropy loss on development-aligned data, thresholds outputs at 0.5 to determine presence, then sums results as before to compute entity salience.

Entities are not represented by contextual token embeddings at any stage; only alignment-based features inform salience estimation.

3. Training Regimen and Evaluation Protocol

For the Stanza coreference approach, the coreference system is applied as provided, with no domain-specific adaptation. The ensemble method uses off-the-shelf scikit-learn logistic regression (L2 regularization, C=1.0C=1.0), trained solely on manually aligned dev data; no neural architectures or additional training loops are introduced.

Evaluation employs three main criteria:

  • Spearman’s ρ\rho: correlation between predicted and gold S(e){0,1,2,3,4,5}S(e)\in\{0,1,2,3,4,5\}0.
  • Calibration (RMSE): root mean square error between predicted and actual scores.
  • Top-S(e){0,1,2,3,4,5}S(e)\in\{0,1,2,3,4,5\}1 Retrieval (precision, recall, F1):
    • Top1: Entities with true S(e){0,1,2,3,4,5}S(e)\in\{0,1,2,3,4,5\}2
    • Top3: Entities with S(e){0,1,2,3,4,5}S(e)\in\{0,1,2,3,4,5\}3

Baselines include a position-only heuristic (first 10% of sentences S(e){0,1,2,3,4,5}S(e)\in\{0,1,2,3,4,5\}4 score 5, etc.), and direct zero- and few-shot prompting of LLMs (e.g., GPT-4o, Llama 3.2-Instruct, Mistral 7B-Instruct), with LLMs tasked to select salient entities and provide a 1–5 salience score.

4. Empirical Results

Key outcomes on the test set are summarized below:

Model/Baseline Spearman’s S(e){0,1,2,3,4,5}S(e)\in\{0,1,2,3,4,5\}5 RMSE Top1 F1 Top3 F1
Position baseline 0.153 2.554
GPT-4o 3-shot 0.254 1.111 0.405 0.361
Llama 3.2-Instruct 0.223 1.296
Mistral 7B-Instruct 0.254 1.206
Stanza coref 0.384 1.031 0.321 0.448
Ensemble (ours) 0.540 1.067 0.367 0.527

The ensemble’s S(e){0,1,2,3,4,5}S(e)\in\{0,1,2,3,4,5\}6 improvement over GPT-4o 3-shot is statistically significant (Wilcoxon S(e){0,1,2,3,4,5}S(e)\in\{0,1,2,3,4,5\}7). The ensemble also achieves highest recall for top-1 salient entities (0.755) but at some cost to precision. Top-3 F1 performance is strongest for the ensemble (0.527), confirming that combining multiple alignment signals enhances graded salience retrieval.

5. Analysis and Ablations

A series of ablations isolates alignment module contributions:

  • String match only: micro-F1 = 0.56
  • GPT-4o alignment only: micro-F1 = 0.75
  • Stanza coref only: micro-F1 = 0.77
  • All three plus linguistic features (ensemble): micro-F1 = 0.98 (positive-class F1 = 0.90)

This demonstrates that integrating both precision-oriented (string match, GPT-4o) and recall-oriented (Stanza coref) alignment signals with document-level features yields substantial improvements in alignment, which translate to graded salience prediction gains.

Error analysis surfaces several trends:

  1. Both LLM and alignment-based models over-predict mid-range scores (3/4) for true low-salience entities, reducing calibration at the extremes.
  2. Entity-type breakdown shows that concrete types (animal, plant, organization) exhibit high LLM precision/recall, while abstract/event/time entities remain challenging. Person/place types display high recall but suffer from over-prediction, lowering precision.
  3. False positives are concentrated among early-mentioned entities (“over-salience” due to positional bias), whereas false negatives are more evenly distributed.

Coarse features such as entity type, genre, and entity position stabilize predictions, particularly for entities with limited summary evidence.

6. Context and Implications

GUM-SAGE’s proxy-based, summary-alignment-driven approach offers increased stability compared to subjective gradient scoring and increased expressivity versus traditional summarization-based binary salience. The methodology highlights the limitations of zero-/few-shot LLM prompting for fine-grained entity ranking and formalizes multi-summary alignment as a robust, explainable method for graded salience prediction.

A plausible implication is that summary-derived, graded salience signals may generalize to user-facing applications in summarization, browsing, and document interpretation where calibrated entity prominence is necessary. The GUM-SAGE dataset and codebase are publicly released to enable further research on multi-document, multi-reference, and genre-sensitive salience modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GUM-SAGE Model.