Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Naamapadam: A Large-Scale Named Entity Annotated Data for Indic Languages (2212.10168v2)

Published 20 Dec 2022 in cs.CL

Abstract: We present, Naamapadam, the largest publicly available Named Entity Recognition (NER) dataset for the 11 major Indian languages from two language families. The dataset contains more than 400k sentences annotated with a total of at least 100k entities from three standard entity categories (Person, Location, and, Organization) for 9 out of the 11 languages. The training dataset has been automatically created from the Samanantar parallel corpus by projecting automatically tagged entities from an English sentence to the corresponding Indian language translation. We also create manually annotated testsets for 9 languages. We demonstrate the utility of the obtained dataset on the Naamapadam-test dataset. We also release IndicNER, a multilingual IndicBERT model fine-tuned on Naamapadam training set. IndicNER achieves an F1 score of more than $80$ for $7$ out of $9$ test languages. The dataset and models are available under open-source licences at https://ai4bharat.iitm.ac.in/naamapadam.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Arnav Mhaske (1 paper)
  2. Harshit Kedia (1 paper)
  3. Sumanth Doddapaneni (16 papers)
  4. Mitesh M. Khapra (79 papers)
  5. Pratyush Kumar (44 papers)
  6. Rudra Murthy V (9 papers)
  7. Anoop Kunchukuttan (45 papers)
Citations (19)