Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Deep Generative Model for Code-Switched Text (1906.08972v1)

Published 21 Jun 2019 in cs.CL

Abstract: Code-switching, the interleaving of two or more languages within a sentence or discourse is pervasive in multilingual societies. Accurate LLMs for code-switched text are critical for NLP tasks. State-of-the-art data-intensive neural LLMs are difficult to train well from scarce language-labeled code-switched text. A potential solution is to use deep generative models to synthesize large volumes of realistic code-switched text. Although generative adversarial networks and variational autoencoders can synthesize plausible monolingual text from continuous latent space, they cannot adequately address code-switched text, owing to their informal style and complex interplay between the constituent languages. We introduce VACS, a novel variational autoencoder architecture specifically tailored to code-switching phenomena. VACS encodes to and decodes from a two-level hierarchical representation, which models syntactic contextual signals in the lower level, and language switching signals in the upper layer. Sampling representations from the prior and decoding them produced well-formed, diverse code-switched sentences. Extensive experiments show that using synthetic code-switched text with natural monolingual data results in significant (33.06%) drop in perplexity.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Bidisha Samanta (14 papers)
  2. Sharmila Reddy (1 paper)
  3. Hussain Jagirdar (2 papers)
  4. Niloy Ganguly (95 papers)
  5. Soumen Chakrabarti (52 papers)
Citations (33)