2000 character limit reached
Fast Search with Poor OCR (1909.07899v3)
Published 17 Sep 2019 in cs.IR and cs.DL
Abstract: The indexing and searching of historical documents have garnered attention in recent years due to massive digitization efforts of important collections worldwide. Pure textual search in these corpora is a problem since optical character recognition (OCR) is infamous for performing poorly on such historical material, which often suffer from poor preservation. We propose a novel text-based method for searching through noisy text. Our system represents words as vectors, projects queries and candidates obtained from the OCR into a common space, and ranks the candidates using a metric suited to nearest-neighbor search. We demonstrate the practicality of our method on typewritten German documents from the WWII era.
- Taivanbat Badamdorj (2 papers)
- Adiel Ben-Shalom (1 paper)
- Nachum Dershowitz (31 papers)
- Lior Wolf (217 papers)