Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PALI: A Language Identification Benchmark for Perso-Arabic Scripts (2304.01322v1)

Published 3 Apr 2023 in cs.CL

Abstract: The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where ``unconventional'' writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sina Ahmadi (23 papers)
  2. Milind Agarwal (5 papers)
  3. Antonios Anastasopoulos (111 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.