Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing (2503.21670v2)

Published 27 Mar 2025 in cs.CL and cs.AI

Abstract: We introduce COMI-LINGUA, the largest manually annotated Hindi-English code-mixed dataset, comprising 125K+ high-quality instances across five core NLP tasks: Matrix Language Identification, Token-level Language Identification, POS Tagging, Named Entity Recognition (NER), and Machine Translation. Each instance is annotated by three bilingual annotators, yielding over 376K expert annotations with strong inter-annotator agreement (Fleiss' Kappa $\geq$ 0.81). The rigorously preprocessed and filtered dataset covers both Devanagari and Roman scripts and spans diverse domains, ensuring real-world linguistic coverage. Evaluation reveals that closed-source LLMs significantly outperform traditional tools and open-source models. Notably, one-shot prompting consistently boosts performance across tasks, especially in structure-sensitive predictions like POS and NER, highlighting the effectiveness of prompt-based adaptation in code-mixed, low-resource settings. COMI-LINGUA is publicly available at: https://github.com/lingo-iitgn/CodeMixing_Project.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Rajvee Sheth (2 papers)
  2. Himanshu Beniwal (9 papers)
  3. Mayank Singh (92 papers)