EuskañolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching (2502.03188v1)

Published 5 Feb 2025 in cs.CL and cs.AI

Abstract: Code-switching (CS) remains a significant challenge in NLP, mainly due a lack of relevant data. In the context of the contact between the Basque and Spanish languages in the north of the Iberian Peninsula, CS frequently occurs in both formal and informal spontaneous interactions. However, resources to analyse this phenomenon and support the development and evaluation of models capable of understanding and generating code-switched language for this language pair are almost non-existent. We introduce a first approach to develop a naturally sourced corpus for Basque-Spanish code-switching. Our methodology consists of identifying CS texts from previously available corpora using language identification models, which are then manually validated to obtain a reliable subset of CS instances. We present the properties of our corpus and make it available under the name Euska~nolDS.

Summary

The paper introduces EuskañolDS, a new naturally sourced corpus designed to address the scarcity of data for Basque-Spanish code-switching in NLP research.
Corpus creation involved identifying potential code-switching instances using language models, followed by rigorous manual validation to ensure data quality and reliability.
The EuskañolDS corpus is publicly available and applicable for training code-switching language models, evaluating machine translation systems, and conducting linguistic analysis of code-switching phenomena.

The "EuskañolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching" paper (2502.03188) addresses the scarcity of data for code-switching (CS) in NLP, particularly for the Basque-Spanish language pair, by introducing a new corpus named EuskañolDS. This corpus aims to support the development and evaluation of models capable of understanding and generating code-switched language, which is prevalent in both formal and informal interactions in the Basque region.

Methodology for Corpus Creation

The methodology employed to construct the EuskañolDS corpus involves several key steps, focusing on naturally sourced data to ensure real-world relevance:

Identification of CS Texts: The initial phase involves leveraging existing corpora to identify potential code-switching instances. Language identification models are applied to these corpora to detect texts that exhibit characteristics of both Basque and Spanish. This automated approach serves as an efficient filter for pinpointing relevant data segments.
Manual Validation: Recognizing the limitations of automated language identification, the identified CS texts undergo a rigorous manual validation process. Human annotators review each instance to confirm the presence of genuine code-switching and ensure the reliability of the corpus. This step is crucial for maintaining high data quality and minimizing errors.
Corpus Compilation: The validated code-switching instances are then compiled into the EuskañolDS corpus. This corpus is structured to facilitate its use in NLP tasks, such as training LLMs, evaluating machine translation systems, and analyzing code-switching phenomena.

Corpus Properties and Availability

The EuskañolDS corpus possesses specific properties designed to make it a valuable resource for NLP research. The size and composition of the corpus are detailed in the paper, including statistics on the frequency of code-switching, the distribution of Basque and Spanish words, and the types of linguistic structures present. The corpus is made publicly available to encourage its use by researchers and practitioners.

Practical Applications and Implementation

The EuskañolDS corpus can be applied to various practical NLP tasks. Here's how it can be implemented:

Training Code-Switching LLMs

Data Preparation: The corpus needs to be preprocessed into a suitable format for training LLMs. This involves tokenization, lowercasing, and the creation of vocabulary sets for both Basque and Spanish. Subword tokenization techniques like Byte Pair Encoding (BPE) or WordPiece can handle out-of-vocabulary words and morphologically rich words.
Model Selection: Common choices include sequence-to-sequence models like Transformers or recurrent neural networks (RNNs) such as LSTMs or GRUs. Transformer models are often preferred due to their ability to capture long-range dependencies and parallelization capabilities.

Training Procedure: The LLM is trained to predict the next word in a sequence, given the preceding words. The training loss is typically cross-entropy, and optimization is performed using algorithms like Adam. Regularization techniques such as dropout and weight decay are used to prevent overfitting.

from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

# Load pre-trained model and tokenizer
model_name = "bert-base-multilingual-cased"  # Or other suitable multilingual model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Prepare dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = euskanol_dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
)

trainer.train()

Evaluating Machine Translation Systems

Test Set Creation: A subset of the EuskañolDS corpus can be reserved as a test set for evaluating machine translation systems. This test set should contain code-switched sentences that are representative of real-world usage.
Translation: The machine translation system is used to translate sentences from either Basque to Spanish or Spanish to Basque.
Evaluation Metrics: The quality of the translation is assessed using metrics such as BLEU, METEOR, and TER. These metrics compare the translated output to reference translations and provide a quantitative measure of translation accuracy. Human evaluation can provide a more nuanced assessment of translation quality, particularly in the context of code-switching.

Analyzing Code-Switching Phenomena

Linguistic Analysis: The corpus can be used to paper linguistic patterns in code-switching, such as the types of words and phrases that are most frequently switched, the syntactic structures that allow code-switching, and the sociolinguistic factors that influence code-switching behavior.
Statistical Analysis: Statistical methods can be used to quantify the frequency and distribution of code-switching patterns. This can provide insights into the dynamics of language contact and the evolution of code-switching practices.

Implementation Considerations

Computational Resources: Training LLMs on the EuskañolDS corpus may require significant computational resources, including GPUs and large amounts of memory. The exact requirements will depend on the size of the model and the training data.
Data Preprocessing: Careful data preprocessing is essential for achieving good performance. This includes handling noise in the data, normalizing text, and creating appropriate vocabulary sets.
Ethical Considerations: The use of code-switching data raises ethical considerations related to privacy and representation. It is important to ensure that the data is used in a responsible and ethical manner, and that the voices of code-switchers are accurately represented.

Conclusion

The EuskañolDS corpus represents a valuable contribution to the field of NLP, providing a much-needed resource for studying and modeling code-switching between Basque and Spanish. The methodology used to create the corpus, the properties of the corpus, and its potential applications make it a significant tool for researchers and practitioners working in this area.