2000 character limit reached
DuoSearch: A Novel Search Engine for Bulgarian Historical Documents (2305.19392v1)
Published 30 May 2023 in cs.IR
Abstract: Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.
- Angel Beshirov (2 papers)
- Suzan Hadzhieva (1 paper)
- Ivan Koychev (33 papers)
- Milena Dobreva (3 papers)