Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PESTO: Switching Point based Dynamic and Relative Positional Encoding for Code-Mixed Languages (2111.06599v1)

Published 12 Nov 2021 in cs.CL, cs.AI, and cs.LG

Abstract: NLP applications for code-mixed (CM) or mix-lingual text have gained a significant momentum recently, the main reason being the prevalence of language mixing in social media communications in multi-lingual societies like India, Mexico, Europe, parts of USA etc. Word embeddings are basic build-ing blocks of any NLP system today, yet, word embedding for CM languages is an unexplored territory. The major bottleneck for CM word embeddings is switching points, where the language switches. These locations lack in contextually and statistical systems fail to model this phenomena due to high variance in the seen examples. In this paper we present our initial observations on applying switching point based positional encoding techniques for CM language, specifically Hinglish (Hindi - English). Results are only marginally better than SOTA, but it is evident that positional encoding could bean effective way to train position sensitive LLMs for CM text.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Mohsin Ali (8 papers)
  2. Kandukuri Sai Teja (2 papers)
  3. Sumanth Manduru (2 papers)
  4. Parth Patwa (28 papers)
  5. Amitava Das (44 papers)
Citations (2)