Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Clustering Framework for Lexical Normalization of Roman Urdu (2004.00088v1)

Published 31 Mar 2020 in cs.CL and cs.LG

Abstract: Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Abdul Rafae Khan (8 papers)
  2. Asim Karim (7 papers)
  3. Hassan Sajjad (64 papers)
  4. Faisal Kamiran (4 papers)
  5. Jia Xu (87 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.