Named Entity Recognition (NER)

Updated 17 January 2026

Named Entity Recognition (NER) is a foundational NLP task that identifies and classifies text spans into predefined real-world entity types.
NER methods have evolved from rule-based and probabilistic models to advanced deep learning architectures like BiLSTM-CRF and transformer-based systems.
Modern challenges include domain adaptation, nested entity handling, and interpretability, spurring research in multimodal, retrieval-based, and dynamic NER approaches.

Named Entity Recognition (NER) is a foundational task in information extraction and natural language processing that involves identifying spans of text corresponding to real-world entities—such as people, organizations, locations, dates, and specialized domain-specific types—and classifying each span into predefined categories. Formally, given an input token sequence $X = (x_1, x_2, …, x_n)$ , the task is to predict a sequence of labels $Y = (y_1, y_2, …, y_n)$ , often under schemas such as BIO/BILOU, where each $y_t$ denotes the entity tag (possibly with boundary markers) for token $x_t$ (Munnangi, 2024). Accurate NER is a prerequisite for a wide range of downstream applications, including semantic search, information retrieval, knowledge graph construction, and question answering (Pakhale, 2023).

1. Historical Trajectory and Paradigms

The evolution of NER spans several generations of methodologies (Munnangi, 2024, Roy, 2021, Pakhale, 2023):

Rule-Based and Gazetteer Systems (1996–2000):

Early NER systems, such as those deployed for MUC-6, relied on hand-crafted regular expressions, lexical gazetteers, and syntactic heuristics. These approaches exhibited high precision (≈80–90%) but limited recall (≈50–70%), suffering from brittle coverage and domain dependency (Munnangi, 2024, Thenmalar et al., 2015, Jumani et al., 2019).

Probabilistic Sequence Models (2000–2012):

The focus shifted to statistical models able to exploit annotated corpora, notably Hidden Markov Models (HMMs) and linear-chain Conditional Random Fields (CRFs). In CRFs, the conditional probability of a tag sequence is modeled as

$P(Y|X) = \frac{1}{Z(X)} \exp\left( \sum_{t=1}^n \sum_k \lambda_k f_k(y_{t-1}, y_t, X, t) \right)$

with features $f_k$ and weights $\lambda_k$ optimized on labeled data (Munnangi, 2024, Pakhale, 2023, Roy, 2021). CRFs permitted richer, overlapping feature sets and remained dominant in NER into the deep learning era.

Neural Architectures and Pretrained Models (2013–Present):

Neural NER systems displaced manual feature engineering by employing end-to-end architectures. Central among these is the BiLSTM-CRF model, where context-sensitive embeddings are produced via forward/backward LSTMs, then scored by a CRF structure for optimal sequence decoding (Munnangi, 2024, Roy, 2021, Abujabal et al., 2018). The adoption of deep contextualized transformers (ELMo, BERT, XLNet) further improved performance, with token-level encoders producing representations that account for the entire sentence and even document context (Hanh et al., 2021, Pakhale, 2023). These models routinely reach F1 scores of 92–94% on standard English newswire benchmarks (CoNLL-2003) (Munnangi, 2024).

2. Core Methodologies and Architectures

Sequence Labeling and Structured Decoding

Most modern NER systems cast the task as sequence labeling: predict for each token a tag indicating entity type and boundary. The standard decoding layer is a CRF, which allows modeling dependencies between adjacent labels (e.g., B-PER followed by I-PER). BiLSTM-CRF models remain the backbone architecture, with context captured via recurrent layers and emission potentials calculated from their outputs (Roy, 2021, Abujabal et al., 2018). Extensions include using subword units (characters, phonemes, bytes) to address OOV and morphology, with subword-only models achieving competitive performance in large-scale settings (Abujabal et al., 2018).

Transformer-Based and Hybrid Models

Transformer architectures, notably BERT and XLNet, dominate state-of-the-art NER through fine-tuning, where a linear or CRF layer is attached atop transformer outputs. Architectures combining contextual transformer streams (e.g., XLNet) with global syntactic or semantic information, as in graph convolutional networks (GCNs) applied to dependency graphs, yield consistent performance gains (Hanh et al., 2021). Example: concatenating XLNet embeddings with outputs from a two-layer GCN over dependency parses improves CoNLL-2003 F1 by 0.5 points over pure transformer baselines (Hanh et al., 2021).

Beyond Token-Level: Span, Query, and MRC Formulations

To address nested/overlapping entities and exploit richer supervision, several paradigm shifts have emerged:

Query-Based/MRC-NER: Each entity type becomes a question (MRC paradigm); the model extracts answer spans using a reading-comprehension style architecture (Meng et al., 2019, Wang et al., 2023). This formulation naturally handles nested/overlapping entities and leverages Q&A–pretrained models. Performance improvements are reported across benchmarks: e.g., +4.46 (ACE04) and +6.47 F1 (ACE05) over BERT taggers (Meng et al., 2019).
Multi-Task and Self-Attention for Label Dependencies: Multi-NER decomposes NER into subtasks per entity type and connects them with inter-task self-attention, improving results especially for nested and ambiguous mentions (Wang et al., 2023).
Neural Reranking: Secondary models rerank k-best outputs from base NER models by globally encoding sentence patterns (LSTM/CNN). Integrating a neural reranker yields F1 improvements over both discrete (CRF-based) and LSTM-CRF baselines (Yang et al., 2017).

Semi-Supervised, Bootstrapping, and Weak Supervision

Low-resource and domain-shifted settings motivate semi-supervised and bootstrapping frameworks. One approach iteratively induces surface/context patterns from small seed sets, scores patterns using Basilisk RlogF, and expands discovery through pattern generalization, achieving mid-80s F1 on major English and Tamil NER corpora with minimal annotation (Thenmalar et al., 2015). Weak supervision—via distant labeling from knowledge bases, or active learning—remains central in scaling NER to new languages and specialized domains (Munnangi, 2024, Pakhale, 2023).

Multimodal, Retrieval-Oriented, and Dynamic NER

Recent work extends NER beyond text-only, fixed-label paradigms:

Multimodal NER: Incorporation of visual features (e.g., images in tweets) via global CV and text features improves NER on noisy social media, with modular architectures outperforming prior text-only systems on the Ritter dataset (Esteves et al., 2017).
Retrieval-Integrated and Ultra-Fine NER: The NERtrieve framework formalizes NER as retrieval—given a zero-shot entity type, retrieve all mentions across a corpus—with a silver-annotated dataset of 4M paragraphs and 500 entity types. SOTA dense retrieval systems capture less than 40% of relevant mentions, revealing challenges in fine-grained and zero-shot extraction (Katz et al., 2023).
Dynamic NER: Dynamic NER (DNER) benchmarks robustness when entity types are document- or event-specific and thus context must override surface memorization. Models that excel on static label sets drop 3–4 F1 under dynamic relabeling, underscoring contextual reasoning bottlenecks (Luiggi et al., 2023).

3. Domain Adaptation and Specialized NER

NER performance degrades substantially when moving into domains (biomedical, finance, legal, historical, clinical) with distinct terminology and entity types:

Biomedical: BioBERT, pretrained on PubMed and PMC, boosts NER by 1–3 F1 over generic BERT (Pakhale, 2023).
Finance and Legal: Layout-aware models (ViBERTgrid) and NER systems operating on scanned invoices (via OCR integration) or legal filings achieve 88–92% F1, but need to address tabular/text-layout fusion and rare entity classes (Pakhale, 2023, Skylaki et al., 2020).
Clinical: Multi-label NER with BiLSTM-n-CRF architectures supports simultaneous span detection, polarity, and modality, achieving span-based NER F1 up to 0.894 (i2b2/VA 2010), demonstrating effective joint modeling of entity attributes (Nath et al., 2022).
Historical and Low-Resource: LLM prompting (zero-/few-shot) on historical corpora narrows the gap with fully-supervised systems, with a single in-context example yielding a 12–13 point F1 gain over pure zero-shot for entity annotation (Zhang et al., 25 Aug 2025).

4. Evaluation, Benchmarks, and Analysis

Evaluation Protocols

Standard evaluation relies on precision, recall, and F1 at the exact entity-span and type level:

$\text{Precision} = \frac{TP}{TP + FP},\qquad \text{Recall} = \frac{TP}{TP + FN},\qquad F_1 = \frac{2\cdot\text{Precision}\cdot\text{Recall}}{\text{Precision} + \text{Recall}}$

(Küçük, 2017, Munnangi, 2024, Thenmalar et al., 2015). Strict boundary and type agreement are generally required, but “relaxed” evaluation—accepting overlaps—appears in some settings (e.g., GCN+XLNet NER over CoNLL (Hanh et al., 2021)).

Benchmark Results

CoNLL-2003 remains the primary benchmark for English, with state-of-the-art systems achieving:

CRF: F1 ≈ 85–89%
BiLSTM-CRF: F1 ≈ 90–92%
Fine-tuned BERT variants: F1 ≈ 92–94%
Advanced fusion (XLNet+GCN, softmax decoding): F1 ≈ 93.8% (Hanh et al., 2021)
Reranked neural systems: F1 ≈ 91.6% (Yang et al., 2017)
Query-based MRC: F1 gains of 1–6 points, notably closing the gap for nested and complex datasets (Meng et al., 2019, Wang et al., 2023)

Domain-adapted and low-resource settings report lower absolute F1 (e.g., 48–69% F1 on Turkish tweets; 85% on fine-grained intersectional types; 0.41 F1 for GPT-4 zero-shot ultra-fine NER (Katz et al., 2023)).

5. Challenges, Limitations, and Open Problems

Several enduring technical barriers define current research (Pakhale, 2023, Munnangi, 2024, Katz et al., 2023, Luiggi et al., 2023):

Domain and Schema Shift: Models trained on newswire collapse in specialized or low-resource domains; continual, domain-adaptive learning is unsolved.
Rare and Fine-Grained Entities: Recognition of ultra-fine, intersectional, or previously unseen types—especially in zero-shot or retrieval settings—remains brittle, with supervised F1 dropping sharply (e.g., from 0.90 to 0.40 along the Animal→Butterfly hierarchy) (Katz et al., 2023).
Label Consistency and Contextual Variability: Dynamic NER reveals that even transformer-based models exhibit high inconsistency when types are context-specific, violating the “one-entity-one-type-per-document” ideal by 10–20% (Luiggi et al., 2023).
Nested and Overlapping Entities: Classic sequence tagging fails to represent nested structures; MRC/Q&A paradigm and span-based decoders partially address this but at computational cost (Meng et al., 2019, Wang et al., 2023).
Interpretability and Trustworthiness: Deep NER systems are “black boxes;” evidential uncertainty modeling and explainability for regulated domains are still rare (Pakhale, 2023).
Multilingual, Multimodal, and Conversational NER: Cross-lingual adaptation, integration of visual, layout, or audio information, and chat-style entity annotation are only partially addressed (Esteves et al., 2017, Chen et al., 2022, Pakhale, 2023).

6. Extensions, Retrieval, and Emerging Directions

NER is increasingly framed as a component in broader NLP systems:

NER as Retrieval: The NERetrieve framework unifies recognition and corpus-level retrieval, extending the task to retrieving all mentions of dynamic, possibly unseen, types across large corpora. Current dense retrieval systems achieve recall@|REL| of only 0.22–0.40, far from exhaustiveness (Katz et al., 2023).
Integration with Relation/Attribute Extraction: Multi-label and joint decoders support rich attribute and relation extraction (e.g., polarity/modality in clinical NER (Nath et al., 2022)).
LLMs and Prompting: Zero-shot and few-shot LLMs close much of the supervised gap in specialized or historical NER, with minimal resource cost and language-agnostic applicability, though supervised models remain superior for strict boundary accuracy (Zhang et al., 25 Aug 2025).
Dynamic Schema and Human-in-the-Loop: Flexible, on-the-fly definition of target types, and user-directed retrieval, are priorities for the next generation of NER pipelines, particularly when combined with annotation-efficient modeling strategies (Katz et al., 2023).

7. Summary Table: Selected NER Methodologies and Achievable F1

System / Paradigm	Domain	F1 (or Accuracy)	Notes
Rule-based, Gazetteer-driven (Küçük, 2017)	Turkish Twitter	50–69	Strict span/type match
BiLSTM-CRF (Munnangi, 2024)	CoNLL-2003 English	91–92	State-of-the-art until BERT
BERT fine-tuned (token) (Munnangi, 2024, Pakhale, 2023)	CoNLL-2003 English	92–94	SOTA on newswire
Query-based MRC (Meng et al., 2019)	ACE04/ACE05/Nested	+4 to +6	Over standard BERT tagger
XLNet+GCN (context+global) (Hanh et al., 2021)	CoNLL-2003	93.8	Simple concatenation fusion
Multi-NER (MRC+MTL+Attention) (Wang et al., 2023)	ACE-2004/GENIA/CoNLL	+1–1.3 (delta)	Explicit type dependency modeling
Neural Rerank (Yang et al., 2017)	CoNLL-2003	91.6	On top of BiLSTM-CRF base
Pointer-generator, seq2seq (Skylaki et al., 2020)	Noisy Legal Text	74.5	Outperforms BiLSTM-CRF and DistilBERT on long docs
Multimodal (Text+Image) (Esteves et al., 2017)	Twitter (Ritter)	0.59	Decision-tree, no CRF/gazetteer
Clinical NER+Attributes (Nath et al., 2022)	i2b2/VA 2010	0.894	Joint entity/polarity/modality
LLM, 1-shot prompting (Zhang et al., 25 Aug 2025)	Historical NER	0.68 (Fuzzy F1)	SOTA gap ≈0.1–0.2
Retrieval-based (NERetrieve) (Katz et al., 2023)	Wikipedia (500 types)	0.22–0.40 (Recall@	REL

References

(Küçük, 2017) Joint Named Entity Recognition and Stance Detection in Tweets
(Hanh et al., 2021) Named entity recognition architecture combining contextual and global features
(Yang et al., 2017) Neural Reranking for Named Entity Recognition
(Wang et al., 2023) Named Entity Recognition via Machine Reading Comprehension: A Multi-Task Learning Approach
(Meng et al., 2019) Query-Based Named Entity Recognition
(Nath et al., 2022) NEAR: Named Entity and Attribute Recognition of clinical concepts
(Katz et al., 2023) NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval
(Skylaki et al., 2020) Named Entity Recognition in the Legal Domain using a Pointer Generator Network
(Thenmalar et al., 2015) Semi-supervised Bootstrapping approach for Named Entity Recognition
(Jumani et al., 2019) Named Entity Recognition System for Sindhi Language
(Kesim et al., 2023) Named entity recognition in resumes
(Esteves et al., 2017) Named Entity Recognition in Twitter using Images and Text
(Pakhale, 2023) Comprehensive Overview of Named Entity Recognition: Models, Domain-Specific Applications and Challenges
(Munnangi, 2024) A Brief History of Named Entity Recognition
(Esteves, 2018) Named Entity Recognition on Noisy Data using Images and Text
(Abujabal et al., 2018) Neural Named Entity Recognition from Subword Units
(Chen et al., 2022) AISHELL-NER: Named Entity Recognition from Chinese Speech
(Zhang et al., 25 Aug 2025) Named Entity Recognition of Historical Text via LLM
(Luiggi et al., 2023) Dynamic Named Entity Recognition
(Roy, 2021) Recent Trends in Named Entity Recognition (NER)

Markdown Upgrade to Chat

References (20)

A Brief History of Named Entity Recognition (2024)

Comprehensive Overview of Named Entity Recognition: Models, Domain-Specific Applications and Challenges (2023)

Recent Trends in Named Entity Recognition (NER) (2021)

Semi-supervised Bootstrapping approach for Named Entity Recognition (2015)

Named Entity Recognition System for Sindhi Language (2019)

Neural Named Entity Recognition from Subword Units (2018)

Named entity recognition architecture combining contextual and global features (2021)

Query-Based Named Entity Recognition (2019)

Named Entity Recognition via Machine Reading Comprehension: A Multi-Task Learning Approach (2023)

10.

Neural Reranking for Named Entity Recognition (2017)

11.

Named Entity Recognition in Twitter using Images and Text (2017)

12.

NERetrieve: Dataset for Next Generation Named Entity Recognition and Retrieval (2023)

13.

Dynamic Named Entity Recognition (2023)

14.

Named Entity Recognition in the Legal Domain using a Pointer Generator Network (2020)

15.

NEAR: Named Entity and Attribute Recognition of clinical concepts (2022)

16.

Named Entity Recognition of Historical Text via Large Language Model (2025)

17.

Joint Named Entity Recognition and Stance Detection in Tweets (2017)

18.

AISHELL-NER: Named Entity Recognition from Chinese Speech (2022)

19.

Named entity recognition in resumes (2023)

20.

Named Entity Recognition on Noisy Data using Images and Text (1-page abstract) (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Named Entity Recognition (NER).