Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Deep Generative Model for Code-Switched Text (1906.08972v1)

Published 21 Jun 2019 in cs.CL

Abstract: Code-switching, the interleaving of two or more languages within a sentence or discourse is pervasive in multilingual societies. Accurate LLMs for code-switched text are critical for NLP tasks. State-of-the-art data-intensive neural LLMs are difficult to train well from scarce language-labeled code-switched text. A potential solution is to use deep generative models to synthesize large volumes of realistic code-switched text. Although generative adversarial networks and variational autoencoders can synthesize plausible monolingual text from continuous latent space, they cannot adequately address code-switched text, owing to their informal style and complex interplay between the constituent languages. We introduce VACS, a novel variational autoencoder architecture specifically tailored to code-switching phenomena. VACS encodes to and decodes from a two-level hierarchical representation, which models syntactic contextual signals in the lower level, and language switching signals in the upper layer. Sampling representations from the prior and decoding them produced well-formed, diverse code-switched sentences. Extensive experiments show that using synthetic code-switched text with natural monolingual data results in significant (33.06%) drop in perplexity.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Bidisha Samanta (14 papers)
  2. Sharmila Reddy (1 paper)
  3. Hussain Jagirdar (2 papers)
  4. Niloy Ganguly (95 papers)
  5. Soumen Chakrabarti (52 papers)
Citations (33)

Summary

We haven't generated a summary for this paper yet.