Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Composable Sparse Fine-Tuning for Cross-Lingual Transfer (2110.07560v2)

Published 14 Oct 2021 in cs.CL

Abstract: Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked LLMing in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at https://github.com/cambridgeltl/composable-sft.

Citations (116)

Summary

  • The paper presents LT-SFT, which composes sparse, real-valued masks from source and target language data to enable zero-shot cross-lingual transfer without changing inference parameters.
  • The method utilizes L1 regularization and modular mask composition to maintain the pretrained model's integrity while achieving robust performance on multilingual benchmarks.
  • Experimental results show that LT-SFT surpasses adapter-based methods, with improvements of up to 3.7% in key metrics across tasks like POS tagging, dependency parsing, NER, and NLI.

Composable Sparse Fine-Tuning for Cross-Lingual Transfer

The paper "Composable Sparse Fine-Tuning for Cross-Lingual Transfer" introduces a novel approach to fine-tuning pretrained models that optimizes both modularity and expressivity without increasing parameters during inference or altering model architecture—a notable constraint of prior methods such as adapter-based fine-tuning. The research focuses on leveraging the Lottery Ticket Hypothesis to craft sparse, real-valued masks which can be composed with pretrained models to facilitate zero-shot cross-lingual transfer, aiming for significant improvements in multilingual tasks.

Methodology and Technique

The methodology builds upon the sparse fine-tuning concept, integrating it with the modularity of adapters, to form the Lottery Ticket Sparse Fine-Tuning (LT-SFT) technique. This involves:

  1. Sparse Mask Learning: Task-specific masks are derived from annotated source language data while language-specific masks stem from masked LLMing on target languages. This design allows retaining the pretrained model's parameter set intact by modifying only selected parameters that significantly alter during initial full fine-tuning.
  2. Regularization and Composition: Incorporating L1 regularization to prevent deviation from pretrained values, LT-SFT enforces sparsity to ensure that minimal interference occurs during compositional transfer across tasks and languages. Masks from both task-specific and language-specific domains are summed with base model parameters to achieve adapted performance.

Experimental Results

The performance of LT-SFT is rigorously evaluated across multilingual benchmarks, specifically: Universal Dependencies (POS tagging and dependency parsing), MasakhaNER (Named Entity Recognition), and AmericasNLI (Natural Language Inference). The method demonstrates substantial improvement over existing state-of-the-art MAD-X adapter-based approaches—achieving notable gains of approximately 2.5% in accuracy for POS tagging, 2.5% UAS and 3.7% LAS for dependency parsing, and 1.8% F1 scores in NER, and 1.9% in NLI accuracy.

Further examination illustrates that LT-SFT additionally ensures robustness against hyperparameter variance, providing consistent performance with a fixed percentage of tunable parameters compared to hyperparameter-dependent MAD-X.

Theoretical and Practical Implications

Sparsity emerges as a crucial trait, ensuring prevention against interference and overfitting, promoting modular composition and systematic generalization across tasks and languages in zero-shot setups. The research expands the applicability of the Lottery Ticket Hypothesis beyond pruning to model adaptation, showcasing multifaceted potential in transfer learning scenarios.

Conclusive Insights

The paper makes an empirical case for LT-SFT as a transformative approach in parameter-efficient model adaptation, integrating modularity without adverse effects at inference. It suggests further exploration into multidimensional transfer paradigms, including domain adaptation and multimodal learning, prompted by the promising direction that sparse composable fine-tuning signifies for cross-lingual AI models.