Adapting Multilingual Models to Code-Mixed Tasks via Model Merging (2510.19782v2)

Published 22 Oct 2025 in cs.CL

Abstract: We study model merging as a practical alternative to conventional adaptation strategies for code-mixed NLP. Starting from a multilingual base model, we: (i) perform continued pre-training (CPT) on unlabeled code-mixed text to obtain an adapted checkpoint, (ii) merge checkpoint with the base model, and (iii) fine-tune (FT) on the downstream task data. We evaluate our approach for sentence classification (sentiment and hate speech) task in English-Hindi (En-Hi) and English-Spanish (En-Es) using XLM-R and Llama-3.2-1B models. Our results show that merged models consistently outperform full fine-tuning and CPT->FT. We observe gains of 2--5 points in F1 over full fine-tuning and ~1-2 points over CPT->FT, indicating that unlabeled data is leveraged more effectively via merging than via CPT alone. Zero-/few-shot prompting with larger LLMs (e.g., Llama-3.3-70B) lags behind fine-tuned and merged checkpoints, underscoring limits of in-context learning for code-mixed inputs. We further test cross-pair transfer by training on En-Hi and evaluating on En-Ta and En-Ml: merged checkpoints transfer more strongly than monolingual-English baselines (e.g., TV/TIES variants reaching 0.65-0.68 F1 vs 0.61-0.63 for full fine-tuning), suggesting that code-mixed knowledge is a more reliable substrate for low-resource pairs. We conclude with adaptation recipes matched to common data regimes (labeled only; labeled+unlabeled; transfer-only) and discuss limitations and scaling considerations for broader tasks and larger models.

Summary

The paper demonstrates that model merging consistently outperforms full fine-tuning and CPT in code-mixed tasks by leveraging unlabeled data effectively.
The methodology employs Task Arithmetic and TIES to merge model weights, preserving monolingual competence while integrating code-mixed information.
Empirical results reveal gains of +2–5 F1 over full fine-tuning and robust transfer performance to low-resource language pairs.

Model Merging for Code-Mixed Multilingual NLP: A Technical Analysis

Introduction

This paper presents a systematic evaluation of model merging as an adaptation strategy for code-mixed NLP tasks, specifically focusing on sentence classification (sentiment and hate speech) in English-Hindi (En-Hi) and English-Spanish (En-Es) language pairs. The paper contrasts model merging with conventional approaches such as full fine-tuning and continued pre-training (CPT), using state-of-the-art multilingual models (XLM-R, Llama-3.2-1B). The central hypothesis is that model merging can more effectively integrate code-mixed knowledge while preserving monolingual capabilities, thereby yielding superior performance in both direct and transfer settings.

Methodological Framework

Model Adaptation Strategies

The paper evaluates three principal adaptation strategies:

Full Fine-Tuning (FullFT): Direct fine-tuning of a base multilingual model on labeled code-mixed data.
Continued Pre-Training (CPT->FT): Unsupervised pre-training on unlabeled code-mixed corpora followed by supervised fine-tuning.
Model Merging: Integration of model weights using Task Arithmetic and TIES (TrIm, Elect, Sign) methods, combining checkpoints adapted via CPT and/or monolingual fine-tuning.

Model merging is operationalized by computing task vectors (difference in weights between base and adapted models) and merging them via vector addition (Task Arithmetic) or magnitude/sign-based selection (TIES). The merged model is then fine-tuned on the target code-mixed task.

Experimental Setup

Models: XLM-R, mBERT, Llama-3.2-1B (adaptation), Llama-3.2-3B/8B/70B (inference).
Datasets: GLUECoS, SentiMix, Prabhu et al. (En-Hi), Patwa et al. (En-Es), SST5 (English), Chakravarthi et al. (En-Ta, En-Ml), Das et al. (unlabeled code-mixed).
Tasks: Sentiment analysis (3-class), hate speech detection.
Metrics: Macro F1 score, with early stopping and hyperparameter search for optimization.
Implementation: PyTorch, Huggingface Transformers, PEFT, MergeKit.

Empirical Results

Performance Comparison

Model merging consistently outperforms both full fine-tuning and CPT->FT across all evaluated models and datasets. Gains of +2–5 F1 over FullFT and +1–2 F1 over CPT->FT are observed, indicating that model merging leverages unlabeled code-mixed data more efficiently than CPT alone. Task Arithmetic generally yields higher scores than TIES, though the difference is model- and task-dependent.

Zero- and few-shot prompting with large LLMs (Llama-3.3-70B) underperform compared to fine-tuned and merged models, with F1 scores plateauing at higher k-shot levels. This demonstrates the limitations of in-context learning for code-mixed inputs, even with increased model capacity.

Data Regime Analysis

Labeled Data Only: Full fine-tuning remains optimal.
Labeled + Unlabeled Data: Model merging (CPT checkpoint merged with base model, then fine-tuned) is most effective.
Transfer-Only (Low-Resource): Merged models trained on code-mixed resources transfer more robustly to new language pairs (e.g., En-Ta, En-Ml), outperforming monolingual-English baselines by 5–13 F1 points.

Combining monolingual and code-mixed resources via merging does not yield substantial additional gains over CPT->FT, suggesting diminishing returns for multi-source merging in this context.

Corpus Variability

The choice of corpus for CPT (real vs. synthetic code-mixed data) has only marginal impact on downstream performance, with classification results varying by ~1 F1 point. This suggests that the merging strategy is robust to moderate distributional shifts in the adaptation corpus.

Implementation Considerations

Computational Efficiency: Model merging is parameter-efficient, requiring only the merging of model weights and subsequent fine-tuning, without the need for access to original training data.
Scalability: The paper is limited to models ≤1B parameters for adaptation due to resource constraints. Scaling merging methods to larger models (>1B) remains an open challenge.
Hyperparameter Sensitivity: Optimal scaling factors for task vectors (λ in Task Arithmetic) must be tuned on validation data, introducing additional complexity.
Task Generalizability: The findings are restricted to sentence classification; extension to other NLP tasks (e.g., sequence labeling, generation) requires further investigation.

Theoretical and Practical Implications

The results substantiate the claim that model merging is a viable and often superior alternative to conventional adaptation strategies for code-mixed NLP, especially under resource constraints. The ability to integrate heterogeneous sources (monolingual, code-mixed, unlabeled) into a unified model without catastrophic forgetting or loss of monolingual competence is particularly valuable for low-resource and transfer scenarios.

From a theoretical perspective, the paper reinforces the utility of modular adaptation and weight-space arithmetic for domain and language transfer, aligning with recent advances in model editing and fusion. Practically, the findings provide actionable adaptation recipes for varying data regimes, guiding practitioners in selecting optimal strategies based on resource availability.

Future Directions

Scaling to Larger Models: Investigate the behavior of model merging for code-mixed tasks in LLMs >1B parameters.
Task Expansion: Evaluate merging strategies for a broader range of NLP tasks beyond sentence classification.
Unlabeled-Only Regimes: Explore adaptation when only unlabeled code-mixed data is available, without task-specific labels.
Optimal Merging Algorithms: Develop more sophisticated merging algorithms to resolve parameter interference and maximize transferability.

Conclusion

This paper demonstrates that model merging, via Task Arithmetic and TIES, is an effective, resource-efficient strategy for adapting multilingual models to code-mixed NLP tasks. Merged models consistently outperform traditional fine-tuning and CPT approaches, particularly in low-resource and cross-lingual transfer settings. The approach is robust to corpus variability and offers practical adaptation recipes for diverse data regimes. Future work should address scalability, task generalizability, and merging algorithm optimization to further advance code-mixed NLP.