Efficient Low-Resource Language Adaptation via Multi-Source Dynamic Logit Fusion

Published 20 Apr 2026 in cs.CL | (2604.18106v1)

Abstract: Adapting LLMs to low-resource languages (LRLs) is constrained by the scarcity of task data and computational resources. Although Proxy Tuning offers a logit-level strategy for introducing scaling effects, it often fails in LRL settings because the large model's weak LRL competence might overwhelm the knowledge of specialized smaller models. We thus propose TriMix, a test-time logit fusion framework that dynamically balances capabilities from three different sources: LRL competence from a continually pretrained small model, task competence from high-resource language instruction tuning, and the scaling benefits of large models. It is data- and compute-efficient, requiring no LRL task annotations, and only continual pretraining on a small model. Experiments across four model families and eight LRLs show that TriMix consistently outperforms single-model baselines and Proxy Tuning. Our analysis reveals that prioritizing the small LRL-specialized model's logits is crucial for success, challenging the prevalent large-model-dominant assumption.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces TriMix, a framework that fuses logits from a continually pretrained LRL model, a large instruction-tuned model, and a small baseline to enhance adaptation.
It employs perplexity and entropy-guided weight selection to dynamically balance language competence, task execution, and scaling effects, achieving notable improvements.
The method eliminates the need for annotated task data, offering a resource-efficient approach for adapting open-source models across diverse low-resource languages.

Multi-Source Logit Fusion for Efficient Low-Resource LLM Adaptation

Motivation and Problem Formulation

The adaptation of LLMs to LRLs is severely constrained by two factors: scarcity of labeled task data and limited computational resources. Conventional approaches either require substantial annotated data for fine-tuning or demand continual pretraining on large-scale models, both of which are infeasible for most LRLs. Proxy Tuning, which fuses small domain-adapted models with larger instruction-tuned models at the logit level, has been shown effective in other domains but exhibits degradation in LRL settings due to the dominant influence of the large model’s weak target-language representations. The presented work introduces TriMix, a test-time logit fusion framework that explicitly integrates language competence, task competence, and scaling benefits, requiring only continual pretraining of small models and no annotated LRL data.

Figure 1: TriMix integrates three sources of benefit for LRL adaptation while minimizing the need for annotating task data and tuning larger models.

TriMix: Tri-Source Dynamic Logit Fusion

TriMix determines an inference-time linear combination of logits from three distinct sources: a small model continually pretrained on the LRL (language competence), a large instruction-tuned model (task and scaling competence), and a baseline small model. The logit fusion is parameterized as:

$L = \alpha L_{\text{large-ins}} + \beta L_{\text{small-cpt}} + (1-\alpha-\beta) L_{\text{small-base}}$

where $\alpha$ and $\beta$ are dynamically selected fusion weights. The decomposition is designed to disentangle and recombine language adaptation, instruction-following behavior, and scaling advantages (capacity gains), enabling systematic integration at each generation step. Efficiency considerations mean that only the small model undergoes CPT on raw LRL text, while the task and scaling effects are exploited through the large instruction-tuned model.

Figure 2: The TriMix framework dynamically fuses the logits of three models for LRL adaptation, balancing language, task, and scaling benefits.

Adaptive Weighting via Perplexity and Entropy

TriMix introduces two unsupervised hyperparameter selection strategies for the fusion coefficients:

Perplexity-guided: $\alpha$ and $\beta$ are chosen to minimize the perplexity of the prompt, allowing dynamic adaptation to the input’s distribution.
Entropy-guided: Hyperparameters are selected to minimize the entropy of the next-token distribution, aligning fusion behavior with model predictive confidence.

Empirically, minimizing perplexity over the entire prompt delivers more robust performance and better approximation to the theoretical upper bound than entropy minimization, which only considers the initial token generation.

Figure 3: Grid search reveals TriMix performance peaks at high $\beta$ , low $\alpha$ regions, indicating the necessity of upweighting LRL competence.

Empirical Evaluation and Contradictory Findings

TriMix is benchmarked across four open-source LLM series (Qwen2.5, Llama2, Llama3.2, Gemma3) and eight LRLs including Tibetan, Uyghur, Kazakh, and Mongolian (using MC $^2$ and MiLiC-Eval), as well as Indian languages (Belebele, SIB-200). Key findings include:

TriMix consistently yields performance improvements vs. the strongest single-model baselines and outperforms Proxy Tuning by substantial margins in every configuration.
In Qwen2.5 experiments, fusing a 1.5B CPT model with a 14B instruction-tuned model achieves a relative improvement of approximately 5% over the 14B baseline.
TriMix (PPL) delivers average relative improvements of +15.2% (Llama2) and +5.3% (Gemma3) over best single models, confirming generalizability.
The theoretically optimal fusion region is in high $\beta$ , low $\alpha$ parameter space; strong weighting of the LRL-adapted CPT model runs counter to the prior "large-model-dominant" fusion assumption (Proxy Tuning).
In contrast to other domains (e.g., code), where Proxy Tuning's equal weighting is justified by equal model divergence from the base, in LRLs instruction-tuned models exhibit large divergence, requiring strong upweighting of CPT for language-specific competence.
Figure 4: TriMix (PPL) achieves consistent performance gains across combinations of small and large model sizes, with scalable adaptation.

Mechanistic and Theoretical Insights

KL divergence analysis demonstrates asymmetric specialization: instruction-tuned models diverge sharply from the base, while LRL CPT models remain close—this discrepancy undermines large-model-dominant fusion, explaining the failure of Proxy Tuning in LRL scenarios. TriMix’s adaptive approach grounds fusion in empirical divergence, producing superior outcomes. Ablation studies confirm the primacy of language modeling in overall performance in LRLs, with scaling effects providing additive gains.

Practical Implications and Applications

TriMix eliminates the need for task data annotation and high-cost large model CPT, making LRL adaptation attainable for resource-constrained environments. The framework's test-time nature and modular logit fusion are compatible with a wide range of open-source LLMs, requiring only access to output token distributions. TriMix is robust to incomplete CPT—partially trained models still benefit from fusion, acting as performance multipliers given limited training budgets.

The method is not directly applicable to closed-source or API-based LLMs lacking logit access; inference-time fusion introduces latency and memory overhead but is mitigable via quantization and compression techniques. Future directions include vocabulary expansion for script-diverse LRLs, more sophisticated fusion routers, and extension to scenarios involving more than three model sources.

Theoretical Implications and Future Directions

TriMix challenges the prevailing assumption that scaling and instruction-following dominate adaptation in LRLs, demonstrating that language-specific competence must be prioritized via logit fusion. Its empirical divergence-based weighting mechanism offers a principled path forward for cross-domain model collaboration, with implications for both parameter-efficient and test-time adaptation. The observed performance saturation and bottlenecks in languages with low tokenization fertility suggest future research into tokenizer alignment and vocabulary augmentation mechanisms.

The approach advocates for broader investigation into logit-level fusion for multilingual adaptation, opening lines of inquiry into learned fusion weights, reinforcement-based strategy selection, and context-sensitive fusion across disparate language domains.

Conclusion

TriMix advances efficient LRL adaptation by dynamically balancing language, task, and scaling signals at the logit level, requiring only small-model CPT and off-the-shelf instruction-tuned models. It empirically refutes large-model-dominant fusion for LRLs, establishing language-specialized CPT as the primary driver of adaptation. TriMix’s scalable, flexible, and resource-efficient nature opens new horizons for equitable language technology deployment across the linguistic spectrum.

Markdown Report Issue