Coreference Resolution Systems

Updated 11 June 2026

Coreference resolution systems are computational frameworks that identify and group textual expressions referring to the same real-world entity to improve language understanding.
They integrate diverse paradigms including rule-based, pairwise statistical, end-to-end neural span-based, and hybrid reinforcement learning approaches to optimize mention clustering.
Practical challenges such as efficiency, fairness, and scalability drive ongoing advancements, with applications in question answering, summarization, and cross-lingual tasks.

Coreference resolution systems are computational frameworks that identify and cluster textual expressions—mentions—referring to the same real-world entity within or across documents. Accurate coreference resolution is a foundational capability for natural language understanding, with direct impact on tasks such as question answering, information extraction, summarization, and discourse analysis. The field encompasses a wide spectrum of methodologies, ranging from deterministic rule-based pipelines to advanced deep neural architectures integrating pretrained LLMs, syntactic parsing, and semantic role labeling.

1. Architectural Paradigms of Coreference Resolution

System architectures are typically classified into four main paradigms:

Rule-Based Deterministic Systems: Early systems such as ARKref (O'Connor et al., 2013) rely on deterministic, linguistically motivated rules. They utilize syntactic parses (e.g., constituent or dependency trees), mention type attributes (pronoun, nominal, proper noun), and semantic category tags (e.g., personhood, gender, number) to constrain antecedent selection. Resolution proceeds via sequential filtering—syntactic pattern matches (appositives, predicate-nominatives), binding constraints (e.g., I-within-I), type compatibility checks (gender, personhood, number), and finally salience measures (shortest parse-tree distance). Clustering is achieved via transitive closure over the antecedent links.
Mention-Pair and Cluster-Ranking Models: Traditional statistical systems model coreference as binary classification or ranking over mention pairs, sometimes incorporating entity-level (cluster) features. Learning-to-search frameworks further refine clustering by optimizing local decisions with respect to global metrics using sequential rollouts (Clark et al., 2016). Recent neural models (e.g., feed-forward and recurrent cluster encoders) replace hand-crafted features with distributed representations, but still require candidate span enumeration and pruning.
End-to-End Neural Span-Based Models: The dominant approach for high-accuracy resolution in the past decade is end-to-end modeling via deep pre-trained LLMs (PLMs), e.g., BERT, RoBERTa, SpanBERT. Given an input sequence, these systems enumerate all spans up to length L, encode each candidate span via concatenation of boundary, head-attention, and surface features, then score possible antecedent links for each span (Pražák et al., 2024, Dong et al., 8 Apr 2025). Span pruning (beam or threshold) is applied to control computational complexity, and output clusters are formed by maximizing global probability or likelihood scores.
Hybrid and Reinforcement Learning Systems: Hybrid architectures combine deterministic rule-based linking with neural policies (actor-critic RL) for non-obvious links (Wang et al., 2022, Wang et al., 2022). The actor-critic framework treats coreference as a sequential decision process over mention-antecedent pairs, jointly optimizing mention detection and clustering with reward signals shaping model exploration.

2. Integration of Syntax, Semantics, and PLMs

The limitations of pure syntactic or semantic pipelines have motivated architectures explicitly bridging these sources. The framework in "Enhancing Coreference Resolution with Pretrained LLMs" (Liu et al., 8 Apr 2025) exemplifies this synthesis:

Contextual embeddings ( $c_i$ ): Derived from a PLM (e.g., BERT, RoBERTa), encoding deep contextual semantics for each token.
Syntactic embeddings ( $s_i$ ): Encodes constituency or dependency parse features (parent, sibling labels), providing structural context.
Semantic role embeddings ( $r_i$ ): Extracted from an SRL module, capturing predicate–argument information for precise referential distinctions.

Fusion of these representations is accomplished via concatenation and linear projection or via a gated sum: $e_i = W_{\text{cat}}[c_i; s_i; r_i] + b_{\text{cat}} \quad \text{or} \quad e_i = W_c c_i + W_s s_i + W_r r_i$ Coreference links are scored using scaled dot-product attention over span representations, with the loss function incorporating both coreference and auxiliary SRL (and optionally parsing) objectives.

Incremental ablation demonstrates monotonic gains: | Component Added | F₁ (%) | |-----------------------|---------| | Base PLM | 75.8 | | +Syntax | 79.8 | | +SRL | 80.8 | | +Attention | 81.8 | | +Full Integration | 84.6 |

This synergy across linguistic representations leads to new SOTA results across diverse domains—legal, scientific, social media, cross-lingual—demonstrating the critical role of multi-source modeling (Liu et al., 8 Apr 2025).

3. Error Modes, Robustness, and Fairness

Even the most accurate systems contend with persistent failure modes:

Parsing and SRL errors: Inaccurate parses or noisy semantic labels, especially in informal or non-standard text genres (e.g., tweets).
Long-distance pronouns: Attention mechanisms may diffuse over excessive spans.
Complex referential structures: Nested or overlapping references, shared semantic roles, or non-standard pronominal use.

Robustness to such phenomena is further challenged for marginalized populations. Gender inclusivity, in particular, remains a live problem. "Toward Gender-Inclusive Coreference Resolution" (Cao et al., 2019) exposes large F₁ disparities on non-binary chains with off-the-shelf resolvers (e.g., F₁{NB}=29.4 vs. F₁{M}=68.1; bias score BS_{NB–M}=–38.7 for Stanford sieve). Remediation strategies—expanded pronoun inventories, fairness-aware regularizers, and category-weighted losses—can halve such disparities and increase non-binary F₁ by >30 points without degrading binary performance.

4. Efficiency, Incrementality, and Scalability

The scalability of coreference systems is both an algorithmic and computational challenge. Span-based models incur $O(n^2)$ mention candidates and—without pruning— $O(n^4)$ possible links, which is intractable for long documents or online dialogue (Dong et al., 8 Apr 2025). Several recent methodologies address this:

Word-level coreference (Dobrovolskii, 2021): Predicts head-word links ( $O(n^2)$ ) and subsequently reconstructs spans, eliminating prior $O(n^4)$ complexity.
Sentence-incremental systems (Grenander et al., 2023): Cache representations (e.g., via XLNet's recurrence), use shift-reduce parsing to propose mentions, cluster incrementally via learned entity representations, and update per-sentence—linear in sentence length and entity count ( $O(nm)$ ).
Segmented/overlapping inference (Pražák et al., 2024): For long documents, overlapping windows ensure that cross-segment links are recovered. Cross-segment merging heuristics further mitigate boundary fragmentation.
Model quantization and pruning (Dong et al., 8 Apr 2025): Post-training quantization and head pruning reduce latency by 20–30% with negligible F₁ drop.
Empirically, word-level models (RoBERTa backbone) achieve 81.0 average F₁ on OntoNotes at 3× speed and 2–3× lower GPU memory relative to span-level baselines (Dobrovolskii, 2021).

5. Evaluation Practices, Metrics, and Benchmarks

Standard evaluation leverages cluster-based metrics:

MUC (link-based)
B³ (mention-based, averages per-mention precision/recall)
CEAF (entity alignment-based), multiple φ-similarity choices
CoNLL average: arithmetic mean of MUC, B³, CEAF F₁
Domain/test-specific setups:
- Cross-lingual: CorefUD (12 languages × 17 datasets) (Pražák et al., 2024)
- Dialogue: Speaker/turn-aware extensions (Dong et al., 8 Apr 2025)
- Biomedical: Headword matching, type-specific constraints (Chen et al., 12 Mar 2025)
- Named person coreference: Entity F₁, Pronoun F₁, and "Chains not found" (Agarwal et al., 2018)

Specialized metrics and benchmarks target task- and population-specific requirements, e.g., fairness audits by gender category (Cao et al., 2019), or evaluation on Winograd-style hard pronoun problems (Peng et al., 2019).

6. Advances, Open Questions, and Future Directions

The field is evolving rapidly, with notable advances and persistent open challenges:

Syntax–Semantics–PLM integration raises accuracy ceilings beyond the capabilities of vanilla PLMs, especially for languages or domains with poor out-of-the-box parser performance (Liu et al., 8 Apr 2025).
Hybrid rule–neural and RL-based systems have demonstrated that high-precision deterministic links, combined with reinforcement learning policies for ambiguous cases, yield gains over both pure paradigms (Wang et al., 2022).
Multilingual and cross-lingual adaptation: Zero-shot and multi-lingual models trained with XLM-RoBERTa plus language-agnostic syntactic/semantic features attain state-of-the-art transfer, although absolute F₁ may still drop up to 15 points in the full zero-shot setting. Head-only modeling and explicit singleton handling are especially beneficial (Pražák et al., 2024).
Socially-responsible modeling: Persistent effort is required to address bias, inclusivity, and real-world representativeness of coreference outputs, especially in emerging downstream applications.
Integration with world knowledge and discourse relations: Purely sequential or local models struggle with phenomena requiring external factual knowledge or awareness of rhetorical structure. Constrained ILP reasoning and predicate-schemata-based world knowledge are proposed solutions for hard pronoun resolution (Peng et al., 2019).

Promising trends include end-to-end latent parse and SRL induction, more scalable document-level inference, knowledge grounding for hard anaphora, robust domain adaptation, and increasingly multimodal coreference (e.g., image+text narratives (Goel et al., 2022)).

Summary Table: Representative Coreference Resolution Systems and Results

System / Approach	Key Innovation	Representative F₁	Reference
Rule-based (ARKref)	Deterministic parsing/rules	B³ F₁=79.5	(O'Connor et al., 2013)
End-to-end span-based	Span pruning + PLM	79.6 (OntoNotes)	(Pražák et al., 2024)
Syntax+Semantics+PLM	PLM+parser+SRL fusion	84.6 (avg, multi)	(Liu et al., 8 Apr 2025)
Hybrid rule+RL	Rules + actor-critic RL	88.6 (OntoNotes)	(Wang et al., 2022)
Word-level coref	Head-word links	81.0 (OntoNotes)	(Dobrovolskii, 2021)
Multilingual (XLM-R)	Cross-lingual pre-training	74.7 (CorefUD avg)	(Pražák et al., 2024)
Fairness-augmented	Gender-penalized loss	F₁_NB=59.7	(Cao et al., 2019)

Coreference resolution systems thus represent a confluence of linguistic theory, deep learning, optimization, and fairness-aware evaluation, with state-of-the-art frameworks integrating context-rich PLMs, explicit syntactic/semantic structure, and modular attention mechanisms, all evaluated under rigorous, domain-sensitive protocols.