Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages (2303.13592v4)

Published 23 Mar 2023 in cs.CL and cs.AI

Abstract: While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for NLP research. The recent proliferation of LLMs compels one to ask: how capable are these systems in generating code-mixed data? In this paper, we explore prompting multilingual LLMs in a zero-shot manner to generate code-mixed data for seven languages in South East Asia (SEA), namely Indonesian, Malay, Chinese, Tagalog, Vietnamese, Tamil, and Singlish. We find that publicly available multilingual instruction-tuned models such as BLOOMZ and Flan-T5-XXL are incapable of producing texts with phrases or clauses from different languages. ChatGPT exhibits inconsistent capabilities in generating code-mixed texts, wherein its performance varies depending on the prompt template and language pairing. For instance, ChatGPT generates fluent and natural Singlish texts (an English-based creole spoken in Singapore), but for English-Tamil language pair, the system mostly produces grammatically incorrect or semantically meaningless utterances. Furthermore, it may erroneously introduce languages not specified in the prompt. Based on our investigation, existing multilingual LLMs exhibit a wide range of proficiency in code-mixed data generation for SEA languages. As such, we advise against using LLMs in this context without extensive human checks.

PDF Abstract

Analysis of Prompting Multilingual LLMs for Code-Mixed Text Generation in SEA Languages

The paper elucidates the challenges and possibilities present in utilizing LLMs for generating code-mixed data, particularly focusing on the languages of South East Asia (SEA). The research presented explores the potential and limitations of state-of-the-art multilingual LLMs in generating code-mixed texts by exploring models such as ChatGPT, InstructGPT, BLOOMZ, and Flan-T5-XXL.

The divergence in performance among these models is stark. Among the models considered, ChatGPT exhibits the greatest proficiency in code-mixing, although its success varies substantially across different language pairs and prompt templates, especially when dealing with structurally diverse SEA languages. Notably, the prompts for English-Tagalog code-mixing showed weaker results, possibly due to the linguistic discrepancies between these languages, such as differences in syntactic order and morphosyntactic alignment.

The investigation further extends to Singlish, an English-based creole, revealing that ChatGPT and InstructGPT demonstrate high proficiency in generating Singlish expressions, unlike BLOOMZ and Flan-T5-XXL. This superior performance suggests that the profusion of Singlish in the training datasets for these models likely contributed to this outcome. However, the presence of syntactic and semantic inaccuracies in seemingly fluent outputs warrants caution against relying on such generated textual data without rigorous human vetting.

A significant finding is that the notion of multilinguality does not inherently endow LLMs with the ability to generate syntactically and semantically sound code-mixed sentences. The models' underlying training data often lack sufficient code-mixed texts, and the instruction tuning does not explicitly incorporate tasks demanding code-mixed language generation. Consequently, while multilingual models like BLOOMZ can process multiple languages, they inherently struggle with integrating these languages into cogent, code-mixed outputs in single utterances.

The research draws out the importance of advancing language technology to encompass the ability to handle code-mixed language as it is utilized by multilingual communities globally. Incorporating code-mixing abilities in LLMs could increase the inclusivity and authenticity of AI-driven conversational agents, ushering in a more natural dialogical interaction by mirroring the linguistic realities of its users.

Despite the paper's focus on generating synthetic code-mixed data to alleviate data scarcity in SEA languages, it highlights critical challenges in automatically generating such texts. The findings emphasize the necessity of human checks for quality assurance, especially given the syntactic and semantic irregularities found in LLM-generated outputs. The research further underscores a pressing need for transparency in model training and data to facilitate the refinement of multilingual models capable of more nuanced linguistic understanding and generation.

Overall, the findings present a compelling case for the NLP community to prioritize research into code-mixing capabilities, alongside traditional multilingual language processing, as foundational in developing more robust, culturally aware, and linguistically versatile AI systems.