Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models (2410.22660v1)
Abstract: Code-switching, the phenomenon of alternating between two or more languages in a single conversation, presents unique challenges for NLP. Most existing research focuses on either syntactic constraints or neural generation, with few efforts to integrate linguistic theory with LLMs for generating natural code-switched text. In this paper, we introduce EZSwitch, a novel framework that combines Equivalence Constraint Theory (ECT) with LLMs to produce linguistically valid and fluent code-switched text. We evaluate our method using both human judgments and automatic metrics, demonstrating a significant improvement in the quality of generated code-switching sentences compared to baseline LLMs. To address the lack of suitable evaluation metrics, we conduct a comprehensive correlation study of various automatic metrics against human scores, revealing that current metrics often fail to capture the nuanced fluency of code-switched text. Additionally, we create CSPref, a human preference dataset based on human ratings and analyze model performance across hard
and easy
examples. Our findings indicate that incorporating linguistic constraints into LLMs leads to more robust and human-aligned generation, paving the way for scalable code-switching text generation across diverse language pairs.
- Lince: A centralized benchmark for linguistic code-switching evaluation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1803–1813.
- One country, 700+ languages: Nlp challenges for underrepresented languages and dialects in indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249.
- Aya 23: Open weight releases to further multilingual progress. arXiv preprint arXiv:2405.15032.
- Code-mixed probes show how pre-trained models generalise on code-switched text. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 3457–3468.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Training data augmentation for code-mixed translation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5760–5766.
- A semi-supervised approach to generate the code-mixed text using pre-trained encoder and transfer learning. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2267–2280.
- Metrics for modeling code-switching across corpora. In Interspeech, pages 67–71.
- Aravind Joshi. 1982. Processing of sentences with intra-sentential code-switching. In Coling 1982: Proceedings of the Ninth International Conference on Computational Linguistics.
- Gluecos: An evaluation benchmark for code-switched nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3575–3585.
- Mitigating translationese in low-resource languages: The storyboard approach. arXiv preprint arXiv:2407.10152.
- Carol Myers-Scotton. 1994. Social motivations for codeswitching. evidence from africa. Multilingua-Journal of Interlanguage Communication, 13(4):387–424.
- Carol Myers-Scotton. 1997. Duelling languages: Grammatical structure in codeswitching. Oxford University Press.
- Franz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational linguistics, 29(1):19–51.
- Carol W Pfaff. 1979. Constraints on language mixing: Intrasentential code-switching and borrowing in spanish/english. Language, pages 291–318.
- Shana Poplack. 1980. Sometimes i’ll start a sentence in spanish y termino en espanol: toward a typology of code-switching1. Linguistics, 18(7-8):581–618.
- Adithya Pratapa and Monojit Choudhury. 2021. Comparing grammatical theories of code-mixing. In Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021), pages 158–167.
- Samanantar: The largest publicly available parallel corpora collection for 11 indic languages. Transactions of the Association for Computational Linguistics, 10:145–162.
- Comet: A neural framework for mt evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702.
- A survey of code-switched speech and language processing. arXiv preprint arXiv:1904.00784.
- Vivek Srivastava and Mayank Singh. 2021. Hinge: A dataset for generation and evaluation of code-mixed hinglish text. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 200–208.
- The decades progress on code-switching research in nlp: A systematic survey on trends and challenges. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2936–2978.
- Metametrics: Calibrating metrics for generation tasks using human preferences. arXiv preprint arXiv:2410.02381.
- Are multilingual models effective in code-switching? In Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching, pages 142–153.
- Worldcuisines: A massive-scale benchmark for multilingual and multicultural visual question answering on global cuisines. arXiv preprint arXiv:2410.12705.
- Code-switched language models using neural based synthetic data from parallel sentences. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 271–280.
- Preference tuning with human feedback on language, speech, and vision tasks: A survey. arXiv preprint arXiv:2409.11564.
- Prompting multilingual large language models to generate code-mixed texts: The case of south east asian languages. In Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pages 43–63.
- Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582.