Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kannada-MNIST: A new handwritten digits dataset for the Kannada language (1908.01242v1)

Published 3 Aug 2019 in cs.CV, cs.LG, and stat.ML

Abstract: In this paper, we disseminate a new handwritten digits-dataset, termed Kannada-MNIST, for the Kannada script, that can potentially serve as a direct drop-in replacement for the original MNIST dataset. In addition to this dataset, we disseminate an additional real world handwritten dataset (with $10k$ images), which we term as the Dig-MNIST dataset that can serve as an out-of-domain test dataset. We also duly open source all the code as well as the raw scanned images along with the scanner settings so that researchers who want to try out different signal processing pipelines can perform end-to-end comparisons. We provide high level morphological comparisons with the MNIST dataset and provide baselines accuracies for the dataset disseminated. The initial baselines obtained using an oft-used CNN architecture ($96.8\%$ for the main test-set and $76.1\%$ for the Dig-MNIST test-set) indicate that these datasets do provide a sterner challenge with regards to generalizability than MNIST or the KMNIST datasets. We also hope this dissemination will spur the creation of similar datasets for all the languages that use different symbols for the numeral digits.

Citations (50)

Summary

  • The paper introduces Kannada-MNIST, a comprehensive dataset for Kannada numeral recognition that achieves a 97.13% accuracy with a standard CNN model.
  • It details a rigorous data collection process from 65 volunteers and supplements the main set with an out-of-domain Dig-MNIST for cross-domain evaluation.
  • Open-sourced code and raw image data encourage further research in OCR, domain adaptation, and multilingual digit recognition.

Overview of "Kannada-MNIST: A New Handwritten Digits Dataset for the Kannada Language"

The paper "Kannada-MNIST: A new handwritten digits dataset for the Kannada language" by Vinay Uday Prabhu is focused on the creation and dissemination of a novel dataset specifically designed for Kannada digit recognition, termed Kannada-MNIST. The dataset aims to serve as a viable drop-in replacement for the original MNIST dataset, which has been extensively used in computer vision research for handwritten digit recognition. This work introduces not only the Kannada-MNIST dataset, but also a secondary dataset, Dig-MNIST, which is intended to act as an out-of-domain test set.

Main Contributions

The main contributions of the paper can be summarized in several key points:

  1. Data Collection and Dataset Creation: The Kannada-MNIST dataset was created using the efforts of 65 volunteers from Bangalore, India, who contributed digit samples. The main dataset consists of 60,000 training images and 10,000 test images. An additional Dig-MNIST dataset comprising 10,240 images was collected in Redwood City, CA, featuring contributions from volunteers who were unfamiliar with Kannada, enhancing its utility as a domain adaptation challenge dataset.
  2. Open Sourcing: All source code, dataset, and raw scanned image data have been made publicly available. This includes the raw scanned images and scanner settings, which are intended to aid researchers interested in experimenting with different image processing pipelines.
  3. Morphological Comparison and Baseline Results: The dataset was morphologically compared with the original MNIST dataset using both visual and quantitative assessments, including morphological trait analysis, dimensionality reduction using UMAP and PCA, and baseline classification using a standard CNN architecture. The classification accuracy achieved was 97.13% on the Kannada-MNIST test set, indicating its viability as a rigorous benchmark dataset. The Dig-MNIST dataset yielded a lower accuracy of 76.2%, highlighting its role in challenging standard models.
  4. Potential in Generalizability Studies: The authors explore the nuanced challenges of classifying Kannada numerals, particularly highlighting intra-class variations and the similarity of certain Kannada digits to the numeral 2 in the modern Hindu-Arabic system.

Implications and Future Directions

The development of the Kannada-MNIST dataset fills a significant gap for non-Latin script numeral recognition in the digit classification domain. The implications for practical applications are particularly significant for OCR technologies in regions where Kannada numerals are prevalent, as well as theoretical advancements in domain generalization and augmentation techniques for numeral recognition.

The dataset's availability, along with the Dig-MNIST dataset, invites work on cross-domain learning and adaptation, making it an excellent candidate for testing algorithms' robustness in different digit-writing styles and scanner settings. Future work could explore the dataset's extension to include other Indic scripts, enhancing multilingual digit recognition research.

Researchers are encouraged to engage with the open challenges presented by the authors, which include drastically improving accuracy on the Dig-MNIST dataset without preprocessing enhancements, investigating catastrophic forgetting in transfer learning scenarios with pre-trained models, and leveraging synthetic data augmentation to generate competitive results.

Overall, "Kannada-MNIST" represents an important step towards inclusivity in digit recognition and a valuable resource for advancing both practical image processing and theoretical machine learning research.

Github Logo Streamline Icon: https://streamlinehq.com