Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text (2403.04872v2)

Published 7 Mar 2024 in cs.CL

Abstract: Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained LLMs handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the LLMs in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained LLMs are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-probes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks. In ArXiv, volume abs/1608.04207.
  2. Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task. In Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching, pages 138–147, Stroudsburg, PA, USA. Association for Computational Linguistics.
  3. LinCE: A Centralized Benchmark for Linguistic Code-switching Evaluation. In LREC.
  4. Part of Speech Tagging for Code Switched Data. In Proceedings of the Second Workshop on Computational Approaches to Code Switching, pages 98–107, Stroudsburg, PA, USA. Association for Computational Linguistics.
  5. Emily M. Bender and Alexander Koller. 2020. Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5185–5198, Stroudsburg, PA, USA. Association for Computational Linguistics.
  6. Barbara E Bullock and Almeida Jacqueline Toribio. 2009. 1 Themes in the study of code-switching. In The Cambridge Handbook of Linguistic Code-switching, pages 1–10. Cambridge University Press.
  7. CALCS 2021 Shared Task: Machine Translation for Code-Switched Data.
  8. Finding Universal Grammatical Relations in Multilingual BERT. In ArXiv, volume abs/2005.04511.
  9. What Does BERT Look at? An Analysis of BERT’s Attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Stroudsburg, PA, USA. Association for Computational Linguistics.
  10. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Stroudsburg, PA, USA. Association for Computational Linguistics.
  11. What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models. In AAAI.
  12. CS-Embed at SemEval-2020 Task 9: The effectiveness of code-switched word embeddings for sentiment analysis. In arXiv preprint arXiv:2006.04597.
  13. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.
  14. A Survey of Code-switching: Linguistic and Social Perspectives for Language Technologies. In ACL/IJCNLP.
  15. A survey of graph edit distance. Pattern Analysis and Applications, 13(1):113–129.
  16. Exploring Network Structure, Dynamics, and Function using NetworkX. In Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA.
  17. Intrinsic Probing through Dimension Selection. arXiv, 2010.02812v1.
  18. John Hewitt and Christopher D Manning. 2019. A Structural Probe for Finding Syntax in Word Representations. In NAACL.
  19. A Survey of Current Datasets for Code-Switching Research. In 2020 6th International Conference on Advanced Computing and Communication Systems (ICACCS), pages 136–141. IEEE.
  20. Aravind K Joshi. 1982. Processing of Sentences With Intra-Sentential Code-Switching. In Coling 1982: Proceedings of the Ninth International Conference on Computational Linguistics.
  21. GLUECoS: An Evaluation Benchmark for Code-Switched NLP. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 3575–3585. Association for Computational Linguistics (ACL).
  22. Multilingual Code-Switching for Zero-Shot Cross-Lingual Intent Prediction and Slot Filling. In Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 211–223, Stroudsburg, PA, USA. Association for Computational Linguistics.
  23. A filter for syntactically incomparable parallel sentences. Linguistics in the Netherlands, 36:147–161.
  24. LINSPECTOR: Multilingual Probing Tasks for Word Representations. Computational Linguistics, 46(2).
  25. Philip May. 2021. Machine translated multilingual STS benchmark dataset.
  26. Erica McClure. 1995. DUELLING LANGUAGES: GRAMMATICAL STRUCTURE IN CODESWITCHING. Carol Myers-Scotton. Oxford: Clarendon Press, 1993. Pp. xiv + 263. Studies in Second Language Acquisition, 17(1):117–118.
  27. Collecting Code-Switched Data from Social Media. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation ({LREC} 2018), Miyazaki.
  28. The ProfNER shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora. In SMM4H.
  29. Overview for the Second Shared Task on Language Identification in Code-Switched Data. In Proceedings of the Second Workshop on Computational Approaches to Code-Switching, pages 40–49, Stroudsburg, PA, USA. Association for Computational Linguistics.
  30. Spanish-English bilingual students’ use of cognates in English reading. Journal of Reading Behavior, 25(3):241–259.
  31. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection.
  32. SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets. In ArXiv, volume abs/2008.04277.
  33. Language Models as Knowledge Bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Stroudsburg, PA, USA. Association for Computational Linguistics.
  34. How Multilingual is Multilingual BERT? In ACL.
  35. SHANA POPLACK. 1980. Sometimes I’ll start a sentence in Spanish Y TERMINO EN ESPAÑOL: toward a typology of code-switching1. Linguistics, 18(7-8):581–618.
  36. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
  37. Joel Laffita Rivera. 2019. A Study Conception about Language Similarities. Open Journal of Modern Linguistics, 09(02):47–58.
  38. GCM: A Toolkit for Generating Synthetic Code-mixed Text. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 205–211, Stroudsburg, PA, USA. Association for Computational Linguistics.
  39. Offensive Content Detection Via Synthetic Code-Switched Text. In Proceedings of the 29th International Conference on Computational Linguistics, pages 6617–6624.
  40. BERTologiCoMix * How does Code-Mixing interact with Multilingual BERT? In Proceedings of the Second Workshop on Domain Adaptation for NLP, pages 111–121.
  41. Language Mixing and Code-Switching in Writing Approaches to Mixed-Language Written Discourse, 1st edition.
  42. A Gold Standard Dependency Corpus for English. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 2897–2904, Reykjavik, Iceland. European Language Resources Association (ELRA).
  43. Overview for the First Shared Task on Language Identification in Code-Switched Data. In Proceedings of the First Workshop on Computational Approaches to Code Switching, pages 62–72. Association for Computational Linguistics.
  44. Victor Soto and Julia Hirschberg. 2017. Crowdsourcing Universal Part-of-Speech Tags for Code-Switching. In Interspeech 2017, pages 77–81, ISCA. ISCA.
  45. AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco. European Language Resources Association (ELRA).
  46. SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding. In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pages 107–121, Stroudsburg, PA, USA. Association for Computational Linguistics.
  47. BERT Rediscovers the Classical NLP Pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601.
  48. What do you learn from context? Probing for Sentence Structure in Contextualized Word Representations. In International Conference on Learning Representations.
  49. EN-ES-CS: An English-Spanish Code-Switching Twitter Corpus for Multilingual Sentiment Analysis. In LREC.
  50. The Decades Progress on Code-Switching Research in NLP: A Systematic Survey on Trends and Challenges. In Findings of the Association for Computational Linguistics: ACL 2023, pages 2936–2978, Stroudsburg, PA, USA. Association for Computational Linguistics.
  51. Code-Switched Language Models Using Neural Based Synthetic Data from Parallel Sentences. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 271–280, Stroudsburg, PA, USA. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets