Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages (2506.07274v1)

Published 8 Jun 2025 in cs.CL and cs.AI

Abstract: Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of LLMs for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Parser, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaran\'i data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaran\'i UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Parser achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments. Data and source code are available at https://github.com/N3mika/ParsingProject

Summary

The paper introduces the BiLingua Parser that leverages few-shot prompting via LLMs to annotate UD syntax for complex code-switched text.
It achieves up to 95.29% Labeled Attachment Score after expert revision, surpassing traditional monolingual baselines.
This methodology enables efficient syntactic analysis in low-resource language pairs, facilitating advances in multilingual NLP research.

Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages

In this paper, the authors address the critical challenge of syntactic analysis in code-switched and low-resource language environments using LLMs. This research is motivated by the unique linguistic phenomenon of code-switching (CSW), prevalent in multilingual communities but underexplored in syntactic parsing due to the scarcity of annotated data. Existing approaches often involve monolingual treebanks, which do not effectively generalize to multilingual or mixed-language input. To bridge this gap, the paper introduces the BiLingua Parser, a novel LLM-based annotation pipeline aimed at generating Universal Dependencies (UD) for code-switched text, specifically focusing on Spanish-English and Spanish-Guaraní data.

Methodology and Key Contributions

The BiLingua Parser departs from traditional supervised parsers by leveraging few-shot prompting with LLMs, supplemented by expert review. This approach minimizes the reliance on vast amounts of training data, making it particularly advantageous for low-resource language pairs. The researchers released two newly annotated datasets, marking the inaugural creation of a Spanish-Guaraní UD-parsed corpus. Additionally, a linguistic case paper was conducted to dissect common syntactic structures at code-switch boundaries, providing insights into cross-linguistic switching patterns.

The paper's methodology combines:

Prompt-based Annotation: Utilizing GPT-4.1 for generating UD annotations, it harmonizes linguistic rules with specific few-shot examples to account for code-switched constructs typical in spontaneous and informal registers, such as ellipses and contractions.
Evaluation and Review: Annotations are revised by language experts to ensure accuracy. Labeled Attachment Scores (LAS) of up to 95.29% were achieved after expert revision, significantly surpassing baseline models. This process unearths the importance of aligning machine-generated outputs with human linguistic judgment, particularly in distinguishing semantically similar tags that traditional metrics might overly penalize.
Structural and Syntactic Analysis: By analyzing switch points, the paper exposes prevalent syntactic roles involved in switching, such as determiner positions and object slots. This analysis augments understanding of syntactic behaviors across language pairs, with Spanish-Guaraní data exhibiting a broader range of switch sites than anticipated.

Results and Implications

The experimental results highlight the practicality of LLMs in bootstrapping linguistic resources under conditions previously deemed prohibitive due to data limitations. By attaining annotation accuracy competitive with, and in some cases exceeding, existing benchmarks, the BiLingua Parser demonstrates the capacity of LLMs to mitigate the challenges posed by typologically diverse and low-resource languages.

The paper's implications are multifaceted:

Practical: The capability to generate high-quality annotations with minimal supervision is transformative for NLP applications involving low-resource languages. This can catalyze the development of linguistic tools, improved machine translation, and enhanced linguistic research for these languages.
Theoretical: The findings stress the need for flexible evaluation metrics that accommodate the nuances of bilingual and code-switched text, highlighting the limitations of conventional LAS metrics in capturing the breadth of syntactic variation.

Looking forward, the techniques and insights from this research could inspire further advancements in syntactic parsing models, potentially extending to other underrepresented multilingual contexts. As the reliance on LLMs grows, their integration into complex linguistic tasks may expand the boundaries of NLP, providing unprecedented access to linguistic diversity. This paper serves as both a methodological blueprint and an impetus for ongoing innovations in the intersection of computational linguistics and multilingualism.