Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast Search with Poor OCR (1909.07899v3)

Published 17 Sep 2019 in cs.IR and cs.DL

Abstract: The indexing and searching of historical documents have garnered attention in recent years due to massive digitization efforts of important collections worldwide. Pure textual search in these corpora is a problem since optical character recognition (OCR) is infamous for performing poorly on such historical material, which often suffer from poor preservation. We propose a novel text-based method for searching through noisy text. Our system represents words as vectors, projects queries and candidates obtained from the OCR into a common space, and ranks the candidates using a metric suited to nearest-neighbor search. We demonstrate the practicality of our method on typewritten German documents from the WWII era.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Taivanbat Badamdorj (2 papers)
  2. Adiel Ben-Shalom (1 paper)
  3. Nachum Dershowitz (31 papers)
  4. Lior Wolf (217 papers)
Citations (1)