Named Entity Recognition and Classification on Historical Documents: A Survey (2109.11406v1)

Published 23 Sep 2021 in cs.CL and cs.LG

Abstract: After decades of massive digitisation, an unprecedented amount of historical documents is available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve and explore information from this 'big data of the past'. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.

Authors (5)

Maud Ehrmann (4 papers)
Ahmed Hamdi (4 papers)
Elvys Linhares Pontes (13 papers)
Matteo Romanello (5 papers)
Antoine Doucet (18 papers)

Citations (89)

View on Semantic Scholar

Summary

This survey (Ehrmann et al., 2021 ) provides a comprehensive overview of Named Entity Recognition and Classification (NER) research applied to historical documents. It highlights the unique challenges posed by this domain compared to standard NER on contemporary texts and surveys existing resources, approaches, and strategies developed to address these issues. The primary goal is to characterize the landscape of historical NER, identifying key challenges, available resources, effective strategies, and priorities for future development.

The paper defines historical documents as textual materials produced or published up to 1979, encompassing a wide range of types, genres, and languages. Applying NER to this material is crucial for content mining, searching, retrieving, and exploring information from large digitized historical collections, which is of high demand among humanities scholars. NER acts as a fundamental step for semantic indexing and supports downstream tasks like entity linking, relation extraction, biography reconstruction, and event detection.

However, historical NER faces four main challenges:

Historical Variety Space: Documents vary significantly in type (administrative, media, literary, etc.), domain, genre, language, and time period, leading to a wide "variety space" that NLP systems, typically optimized for narrow, stable domains, struggle to generalize over. Humanities research interests often span this entire spectrum.
Noisy Input: Text is often acquired via Optical Character Recognition (OCR) or Handwritten Text Recognition (HTR), which introduce noise due to poor material preservation, scanning quality issues, or diverse typographic conventions over time. Optical Layout Recognition (OLR) errors can also mix text segments or result in unnatural tokenization (e.g., excessive hyphenation in column layouts). This noise leads to a sparse feature space and numerous out-of-vocabulary (OOV) words, severely degrading NER performance. Studies show a significant drop in F-score even with moderate noise levels.
Dynamics of Language: Language evolves over time, resulting in historical spelling variations, changes in naming conventions (e.g., titles, structure of names), and entity/context drift (places, organizations, professions emerging or fading). These historical linguistic differences negatively impact NLP tools trained on modern language. Entity drift, particularly, means NER systems trained on one period may perform poorly on another.
Lack of Resources: Compared to modern NER, there is a severe lack of resources for historical documents.
- Typologies: Existing typologies (MUC, CoNLL, ACE) are often insufficient or require adaptation for specific historical domains and entity types (e.g., warships). Defining new typologies and annotation guidelines is time-consuming.
- Annotated Corpora: While some annotated corpora exist (surveyed and summarized in Table 3 of the paper, covering news, literature, and other domains), they are scarce, often small- to medium-sized (most < 30k entities), scattered across languages (mostly monolingual, covering 11 living and 2 dead languages) and time periods (concentrated in 19C-20C). This limits supervised training and reliable evaluation.
- Language Representations: Although large historical text corpora are increasingly available from digitization efforts, their dissemination and use for training LLMs (LMs) are hampered by disparate formats and copyright restrictions. However, historical word embeddings (static and diachronic) and contextualized LMs (Flair, BERT, ELECTRA) trained on historical data are becoming more available (summarized in Table 4).

The survey reviews existing approaches to historical NER based on rule-based, traditional ML, and deep learning (DL) methods:

Rule-based approaches: Early systems often used environments like GATE, relying on hand-crafted rules and gazetteers adapted for historical contexts (e.g., handling occupation names, spelling variations, abbreviations, period-specific entity types like warships). They require linguistic expertise but no training data and are interpretable. Examples include systems for English court trials, American Civil War newspapers, Swedish literary classics, British parliamentary records, and Medieval Spanish texts. While capable of decent precision, they often suffer from low recall on noisy or variable texts and are labor-intensive to develop.
Traditional Machine Learning approaches: The availability of annotated data led to the adoption of ML, particularly Conditional Random Fields (CRFs). Studies applied off-the-shelf modern NER systems (Stanford CRF, OpenNLP) or trained new CRF models on custom historical data. Performances are highly variable depending on the data and training size, generally ranging from 60-70% F-score, significantly lower than on contemporary texts. Recall is typically the most affected metric. Stanford CRF is a commonly used tool. Ensembling multiple systems sometimes improved performance.
Deep Learning approaches: Recent research is dominated by DL methods, leveraging the power of learned representations (embeddings) and architectures like BiLSTM-CRF and Transformers (BERT, ELMo). The focus is often on transfer learning, adapting models pre-trained on large (modern or historical) corpora to the target historical task.
- BiLSTM-CRF models combined with various embeddings (character, sub-word, static word, contextualized) show significant improvements over traditional CRFs, especially when sufficient training data or appropriate pre-trained embeddings are used. Character and sub-word embeddings help handle OOV words and spelling variations.
- Contextualized LM embeddings (Flair, BERT, ELMo) pre-trained on either modern or historical corpora have proven highly effective. Modern LMs transfer reasonably well, but LMs pre-trained on large in-domain or temporally proximate historical corpora often yield the best results. Stacking different types of embeddings (e.g., Flair + BERT) can further boost performance.
- Transformer-based architectures like BERT have achieved state-of-the-art results, particularly when fine-tuned on historical data. Proper pre-processing (like sentence splitting and de-hyphenation) is crucial for BERT to leverage its context window effectively.
- The CLEF-HIPE-2020 shared task provided a valuable benchmark for comparing DL systems on multilingual historical newspapers, demonstrating that state-of-the-art models can achieve F-scores exceeding 80% on specific tasks and languages, though performance is still lower on noisier or less resourced materials.

Strategies to deal with the specific challenges include:

Noisy Input:
- Input adaptation: OCR/OLR post-correction (e.g., correcting "long s," de-hyphenating words, general spelling correction) can be beneficial, but its effectiveness depends on the noise type and level; spurious corrections can degrade performance. Sentence segmentation is crucial for context-aware models.
- Tool adaptation: Using character- or sub-word-level embeddings (fastText, Flair) or transformer tokenizers (WordPiece) helps models process OOV words and misspellings. Augmenting neural models with extra layers can improve robustness to noise (Ehrmann et al., 2021 ).
Dynamics of Language: Gazetteer lookup with string similarity metrics or historical spelling normalisation (manual rules or automatic methods) can help address spelling variations. Using word embeddings and LMs trained on temporally relevant historical corpora is key to capturing language shifts and mitigating entity drift (Ehrmann et al., 2021 ).
Lack of Resources: Adapting existing or defining new typologies suitable for historical needs is essential. Transfer learning, especially fine-tuning pre-trained LMs on limited historical data, is a dominant strategy. Active learning and potentially data augmentation are other avenues. Increased resource sharing (typologies, guidelines, annotated corpora, large-scale historical text data, historical LMs) is vital for driving progress and enabling system comparability.

In conclusion, historical NER has made significant strides, largely due to the adoption of deep learning and transfer learning techniques leveraging historical LLMs. While performances are approaching those on contemporary texts for some settings, challenges remain, particularly regarding robustness to the full spectrum of historical noise and language dynamics, and the scarcity of annotated data for many domains and periods. Key priorities for the future include systematizing transferability experiments across historical settings, improving robustness to diverse noise and linguistic variation, establishing more gold standards and shared tasks for comparability, developing methods for finer-grained historical NER, and promoting resource sharing within the community (Ehrmann et al., 2021 ). Addressing these challenges requires interdisciplinary collaboration between NLP researchers and humanities scholars.

PDF Markdown

Named Entity Recognition and Classification on Historical Documents: A Survey (2109.11406v1)

Summary

Related Papers