Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

bbOCR: An Open-source Multi-domain OCR Pipeline for Bengali Documents (2308.10647v2)

Published 21 Aug 2023 in cs.CV

Abstract: Despite the existence of numerous Optical Character Recognition (OCR) tools, the lack of comprehensive open-source systems hampers the progress of document digitization in various low-resource languages, including Bengali. Low-resource languages, especially those with an alphasyllabary writing system, suffer from the lack of large-scale datasets for various document OCR components such as word-level OCR, document layout extraction, and distortion correction; which are available as individual modules in high-resource languages. In this paper, we introduce Bengali$.$AI-BRACU-OCR (bbOCR): an open-source scalable document OCR system that can reconstruct Bengali documents into a structured searchable digitized format that leverages a novel Bengali text recognition model and two novel synthetic datasets. We present extensive component-level and system-level evaluation: both use a novel diversified evaluation dataset and comprehensive evaluation metrics. Our extensive evaluation suggests that our proposed solution is preferable over the current state-of-the-art Bengali OCR systems. The source codes and datasets are available here: https://bengaliai.github.io/bbocr.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Imam Mohammad Zulkarnain (2 papers)
  2. Shayekh Bin Islam (10 papers)
  3. Md. Zami Al Zunaed Farabe (2 papers)
  4. Md. Mehedi Hasan Shawon (6 papers)
  5. Jawaril Munshad Abedin (2 papers)
  6. Beig Rajibul Hasan (1 paper)
  7. Marsia Haque (1 paper)
  8. Istiak Shihab (1 paper)
  9. Syed Mobassir (1 paper)
  10. MD. Nazmuddoha Ansary (5 papers)
  11. Asif Sushmit (8 papers)
  12. Farig Sadeque (14 papers)
Citations (2)
Github Logo Streamline Icon: https://streamlinehq.com