Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DuoSearch: A Novel Search Engine for Bulgarian Historical Documents (2305.19392v1)

Published 30 May 2023 in cs.IR

Abstract: Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Angel Beshirov (2 papers)
  2. Suzan Hadzhieva (1 paper)
  3. Ivan Koychev (33 papers)
  4. Milena Dobreva (3 papers)

Summary

We haven't generated a summary for this paper yet.