HANA: A HAndwritten NAme Database for Offline Handwritten Text Recognition (2101.10862v2)

Published 22 Jan 2021 in cs.CV and econ.EM

Abstract: Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Probably the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges, these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 3.3 million names. The database contain more than 105 thousand unique names with a total of more than 1.1 million images of personal names, which proves useful for transfer learning to other settings. We provide three examples hereof, obtaining significantly improved transcription accuracy on both Danish and US census data. In addition, we present benchmark results for deep learning models automatically transcribing the personal names from the scanned documents. Through making more challenging large-scale databases publicly available we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition.

Summary

The paper introduces the HANA database, a large-scale offline resource featuring over 3.3M names and 1.1M images from historical police registers.
It employs point set registration to effectively extract and segment diverse names from noisy, variable handwritten documents.
Benchmarking reveals word accuracies above 93% and significant transfer learning gains, demonstrating HANA’s impact on improving transcription models.

Overview of HANA: A HAndwritten NAme Database for Offline Handwritten Text Recognition

The paper introduces the HANA database, a substantial resource designed for offline handwritten text recognition, particularly focusing on handwritten personal names. As the global digitization of historical archives accelerates, effective transcription of these archives remains a significant challenge. Personal names serve as crucial identifiers in linking historical datasets, demanding high transcription accuracy to minimise errors. Hence, the creation of the HANA database, consisting of over 3.3 million names and over 1.1 million images, represents an essential step forward.

Database Construction and Features

The HANA database is distinctive in that it offers a large-scale, unbalanced collection of personal names derived from 1,419,491 police register sheets, spanning from 1890 to 1923 in Copenhagen. It mirrors the complexities encountered in many historical documents, such as image noise and varying handwriting styles, making it highly suitable for training robust Handwritten Text Recognition (HTR) models.

A key strength of the database is its breadth, including 105,607 unique names. This diversity poses a practical challenge due to the imbalance, but it offers a rich environment for developing models that can generalize well to new, unseen data. Furthermore, the acquisition and segmentation of the names leverage point set registration techniques to align and extract data from semi-structured documents efficiently.

Benchmarking and Model Performance

The researchers benchmarked the performance of several deep learning models using the HANA database. A ResNet-50 architecture was employed to transcribe last names, first and last names, and full names. The models exhibited promising accuracy, with a word accuracy (WACC) of 94.33% for last names and 93.52% for first and last names without matching. These figures emphasize the potential of large-scale datasets like HANA to train models capable of handling the variability present in historical document transcription.

To enhance performance, matching was utilized to align predictions closer to valid outcomes, showcasing a practical application of post-processing in improving transcription results.

Transfer Learning Applications

The authors illustrate the utility of the HANA database for transfer learning by applying it to Danish and US census datasets. This strategy yielded a notable improvement in transcription accuracy, especially in scenarios with limited training data. For instance, on the Danish census data, transcribing accuracy rose significantly from 77.8% to 92.2% with transfer learning from HANA.

Implications and Future Directions

The creation of the HANA database has substantial implications for both practical applications and theoretical advancements in AI. Practically, it provides a robust resource for developing more accurate and cost-effective transcription systems, which are invaluable for digitizing and linking historical archives. Theoretically, it serves as a testing ground for advancing HTR models, potentially driving research into handling data imbalance, noise, and handwriting variability.

Future developments could focus on expanding the database with more diverse datasets from different historical contexts and further refining models to handle multi-lingual and multi-script scenarios. Moreover, exploring advanced machine learning techniques like self-supervised learning could further enhance transcription accuracy across challenging datasets.

In summary, the HANA database represents a significant resource for advancing handwritten text recognition, offering both a challenging dataset and a benchmark for the community. Its open accessibility ensures that it can facilitate further research and application development aimed at preserving and accessing historical data with precision.

PDF Markdown