Analysis of Prompting Multilingual LLMs for Code-Mixed Text Generation in SEA Languages
The paper elucidates the challenges and possibilities present in utilizing LLMs for generating code-mixed data, particularly focusing on the languages of South East Asia (SEA). The research presented explores the potential and limitations of state-of-the-art multilingual LLMs in generating code-mixed texts by exploring models such as ChatGPT, InstructGPT, BLOOMZ, and Flan-T5-XXL.
The divergence in performance among these models is stark. Among the models considered, ChatGPT exhibits the greatest proficiency in code-mixing, although its success varies substantially across different language pairs and prompt templates, especially when dealing with structurally diverse SEA languages. Notably, the prompts for English-Tagalog code-mixing showed weaker results, possibly due to the linguistic discrepancies between these languages, such as differences in syntactic order and morphosyntactic alignment.
The investigation further extends to Singlish, an English-based creole, revealing that ChatGPT and InstructGPT demonstrate high proficiency in generating Singlish expressions, unlike BLOOMZ and Flan-T5-XXL. This superior performance suggests that the profusion of Singlish in the training datasets for these models likely contributed to this outcome. However, the presence of syntactic and semantic inaccuracies in seemingly fluent outputs warrants caution against relying on such generated textual data without rigorous human vetting.
A significant finding is that the notion of multilinguality does not inherently endow LLMs with the ability to generate syntactically and semantically sound code-mixed sentences. The models' underlying training data often lack sufficient code-mixed texts, and the instruction tuning does not explicitly incorporate tasks demanding code-mixed language generation. Consequently, while multilingual models like BLOOMZ can process multiple languages, they inherently struggle with integrating these languages into cogent, code-mixed outputs in single utterances.
The research draws out the importance of advancing language technology to encompass the ability to handle code-mixed language as it is utilized by multilingual communities globally. Incorporating code-mixing abilities in LLMs could increase the inclusivity and authenticity of AI-driven conversational agents, ushering in a more natural dialogical interaction by mirroring the linguistic realities of its users.
Despite the paper's focus on generating synthetic code-mixed data to alleviate data scarcity in SEA languages, it highlights critical challenges in automatically generating such texts. The findings emphasize the necessity of human checks for quality assurance, especially given the syntactic and semantic irregularities found in LLM-generated outputs. The research further underscores a pressing need for transparency in model training and data to facilitate the refinement of multilingual models capable of more nuanced linguistic understanding and generation.
Overall, the findings present a compelling case for the NLP community to prioritize research into code-mixing capabilities, alongside traditional multilingual language processing, as foundational in developing more robust, culturally aware, and linguistically versatile AI systems.