Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Multilingual Encoder Language Model Compression for Low-Resource Languages (2505.16956v1)

Published 22 May 2025 in cs.CL

Abstract: In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only LLMs for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.

Multilingual Encoder LLM Compression for Low-Resource Languages

The paper "On Multilingual Encoder LLM Compression for Low-Resource Languages" presents a comprehensive approach to compressing multilingual LLMs while maintaining effectiveness across various linguistic tasks. The authors introduce a novel compression methodology that integrates multiple existing techniques to significantly reduce the size of LLMs suited for low-resource languages.

Methodology

The primary aim of the paper is to explore the extreme compression of multilingual encoder-only models, such as mBERT and XLM-R, targeting low-resource languages. The methodology involves:

  1. Knowledge Distillation: The researchers employ a two-step knowledge distillation process. Initially, transformer layers in the teacher model are reduced by half. This is followed by applying masked LLMing (MLM) and mean squared error (MSE) loss to the student model, ensuring it retains critical language-specific knowledge.
  2. Structured Pruning: This technique reduces the feed-forward network's intermediate size, thereby minimizing redundant capacity without substantially impacting performance.
  3. Hidden Size Truncation: The hidden dimension size is strategically compressed, retaining the first k dimensions, ideally suited to maintain essential representations. The model benefits from a secondary round of knowledge distillation post-compression.
  4. Vocabulary Trimming: The vocabulary size is reduced to retain only the most frequent and relevant language-specific tokens, supporting efficient inference.

Experiments and Findings

The experiments were conducted across three low-resource languages: Maltese, Slovak, and Swahili, using tasks such as sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging. The methodology achieved up to 92% compression rates with only marginal performance degradation of 2-10%.

Key insights include:

  • Distillation Efficacy: It was noted that using a monolingually adapted teacher model during distillation results in favorable student model performance compared to a multilingual teacher.
  • Initialization Strategy: Effective weight initialization strategies play a vital role, with techniques such as the reuse of teacher layers outperforming others like random initialization.
  • Performance Correlation: The extent of degradation in performance was found to correlate with the size of language-specific data available for fine-tuning the teacher model.
  • Task Adapter Capacity: Smaller models showed improved results by increasing task adapter capacity, thus facilitating better knowledge retention during compression.

Implications

From a practical standpoint, this research supports the creation of computationally efficient models that are viable for deployment in low-resource settings, where infrastructure might be limited. The approach optimizes resource utilization while maintaining satisfactory linguistic performance across diverse tasks, promoting inclusivity and accessibility in NLP technologies.

Future Directions

The paper proposes potential enhancements such as exploring sophisticated distillation methods and strategic neuron pruning that target language-specific components. Moreover, refining intermediate layer knowledge transfer could further bolster compression outcomes.

This paper contributes to the narrative of optimizing LLM deployment for low-resource languages. It opens avenues for developing sustainable AI applications with reduced computational demand, aligning with environmental and economic considerations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Daniil Gurgurov (6 papers)
  2. Michal Gregor (11 papers)
  3. Josef van Genabith (43 papers)
  4. Simon Ostermann (26 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com