Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training (1810.11895v3)

Published 28 Oct 2018 in cs.CL

Abstract: We focus on the problem of LLMing for code-switched language, in the context of automatic speech recognition (ASR). LLMing for code-switched language is challenging for (at least) three reasons: (1) lack of available large-scale code-switched data for training; (2) lack of a replicable evaluation setup that is ASR directed yet isolates LLMing performance from the other intricacies of the ASR system; and (3) the reliance on generative modeling. We tackle these three issues: we propose an ASR-motivated evaluation setup which is decoupled from an ASR system and the choice of vocabulary, and provide an evaluation dataset for English-Spanish code-switching. This setup lends itself to a discriminative training approach, which we demonstrate to work better than generative LLMing. Finally, we explore a variety of training protocols and verify the effectiveness of training with large amounts of monolingual data followed by fine-tuning with small amounts of code-switched data, for both the generative and discriminative cases.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (2)

Hila Gonen (30 papers)
Yoav Goldberg (142 papers)

Citations (28)

View on Semantic Scholar

Language Modeling for Code-Switching: Evaluation, Integration of Monolingual Data, and Discriminative Training (1810.11895v3)

Related Papers