Entity Bias Analysis in Neural Summarization

Updated 16 January 2026

Entity Bias Index is a conceptual evaluation framework that uses paired comparisons through entity replacement to reveal bias in summarization outputs.
The methodology employs quantitative measures such as summary similarity and entity mention frequency to detect systematic differences in model behavior.
Empirical findings highlight significant disparities across extractive and abstractive models, underscoring model sensitivities to entity substitutions.

The "Entity Bias Index" (EBI) is not a formally defined or quantified metric in the academic literature on automatic summarization evaluation. However, the conceptual underpinnings associated with measuring entity-specific bias in neural summarization models are exemplified by the entity-based evaluation framework in Zhou and Tan’s study on political bias in automatic summarization (Zhou et al., 2023). Their methodology leverages entity replacement and targeted feature comparison across summary outputs to characterize systematic differences in the portrayal of politicians by extractive and abstractive neural models. The approach operationalizes entity bias through a suite of paired, feature-level quantitative comparisons, rather than through a unified scalar index.

1. Entity-Replacement Experimental Paradigm

The foundational methodology involves an entity-replacement experiment in which all mentions of a source politician (entity $e_1$ ; e.g., "Trump" or "Biden") in a news article $D_i$ are replaced with a target entity $e_2$ (another president: "Biden," "Obama," "Bush"). This results in a pair of articles $(D_i^{e_1}, D_i^{e_2})$ differing solely by entity name. Each paired document is input to state-of-the-art summarization models, yielding paired summaries $(S_i^{e_1}, S_i^{e_2})$ . All subsequent bias measurements are based on comparing these paired summary outputs for systematic differences attributable to the entity identity alone (Zhou et al., 2023).

2. Feature-Based Quantitative Measures

Zhou and Tan’s framework operationalizes the analysis of entity bias by computing a set of feature-based differences between summary pairs:

Summary similarity: Uses Python's difflib.SequenceMatcher (Gestalt pattern matching) to compute a normalized similarity score between $S_i^{e_1}$ and $S_i^{e_2}$ , specifically:

$\mathrm{sim}(S_i^{e_1}, S_i^{e_2}) = \frac{2M}{T}$

where $M$ is the count of matching words, and $T$ is the total number of words across both summaries.

Entity-name mention frequency: Calculates

$f_e(S) = \frac{\# \text{occurrences of entity name in } S}{|S|}$

for both the original and replaced entity.

Normalized frequencies of titles and related terms: For example, frequencies for "Vice President" or "administration" are computed analogously, capturing the contexts or attributions applied to the entity.

A difference feature, $\Delta_i = \phi(S_i^{e_2}) - \phi(S_i^{e_1})$ , is calculated for each aspect $\phi$ (similarity, entity mention, title frequency) on the $i$ th article-pair.

3. Statistical Comparison and Significance Testing

The entity-dependent differences $\Delta_i$ are evaluated across all instances $i = 1, \ldots, N$ via paired $t$ -tests to determine whether the mean difference is significantly different from zero. This approach distinguishes systematic model sensitivities to entity substitution, operationalizing bias in the sense of entity-contingent shifts in summary content or framing. Notably, the absence of a unified index means that bias is characterized in terms of individual feature shifts, each supported by statistical confidence intervals (Zhou et al., 2023).

4. Model Architectures and Dataset Scope

The evaluation encompasses both extractive and abstractive summarization architectures:

Model	Type	Architecture
PreSumm	Extractive	BERT-based
PEGASUS	Abstractive	Transformer
BART	Abstractive	Transformer
ProphetNet	Abstractive	Transformer

Input data are derived from the NOW corpus of news articles (Jan 2020–Dec 2021), filtered to yield articles mentioning only Trump or only Biden (≈158,000 Trump-only; ≈86,000 Biden-only).

5. Empirical Findings on Model Sensitivity

Feature-specific findings establish model- and entity-dependent disparities:

Summary similarity: ProphetNet exhibits the largest mean drop in similarity when substituting Trump for Biden (or vice versa), indicating heightened sensitivity, whereas PreSumm shows the smallest reaction.
Entity mentions: Abstractive models, including PEGASUS and BART, reference "Trump" significantly less frequently than "Biden," "Obama," or "Bush" in matched summary contexts (paired $t$ -test $p<10^{-20}$ ).
Vice President term: In 2020-generated summaries, "Vice President" appears markedly more often when Biden is original or the replacement than for Trump.
Administration framing: Abstractive models more often attach the term "administration" to "Biden" than to "Trump," suggesting an interpretive shift towards individualistic depiction of Trump.

These feature-level differences yield a multidimensional profile of how model generation is systematically sensitive to focal entities.

6. Limitations and Prospects for Unified Indices

The entity-based evaluation framework demonstrates several methodological limitations:

Restriction to four high-profile U.S. presidents and monolingual (English) news data
No co-reference resolution beyond string-matching entity substitution
Use of off-the-shelf decoding hyperparameters
No aggregated "bias index"—only featurewise statistical characterization

A plausible implication is that future research could synthesize these feature shifts into a formal, unified bias index (akin to a quantitative "Entity Bias Index") integrating multiple dimensions of representational asymmetry. This would enable finer-grained tracking of model bias and facilitate cross-model, cross-entity, or multilingual comparisons.

7. Context, Misconceptions, and Theoretical Significance

No formal "Entity Bias Index"—with an explicit formula, bounded value range, or comprehensive scalar summary—exists in Zhou and Tan’s analysis. Rather, the work establishes a suite of interpretable, statistically validated feature comparisons as the empirical basis for bias detection. This feature-driven methodology situates the study within the broader context of social bias auditing in NLP, illustrating the methodological rigor and granularity necessary to expose subtle bias effects in modern neural text generation (Zhou et al., 2023). Misconceptions that a single EBI metric exists are not supported by this research; the current paradigm is fundamentally multidimensional and statistical in nature.

Markdown Report Issue Upgrade to Chat

References (1)

Entity-Based Evaluation of Political Bias in Automatic Summarization (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Entity Bias Index (EBI).