Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Language Models are Super Mario: Absorbing Abilities from Homologous Models as a Free Lunch (2311.03099v3)

Published 6 Nov 2023 in cs.CL and cs.LG

Abstract: In this paper, we unveil that LLMs (LMs) can acquire new capabilities by assimilating parameters from homologous models without retraining or GPUs. We first introduce DARE to set most delta parameters (i.e., the disparity between fine-tuned and pre-trained parameters) to zeros without affecting the abilities of Supervised Fine-Tuning (SFT) LMs, which randomly Drops delta parameters with a ratio $p$ And REscales the remaining ones by $1 / (1 - p)$ to approximate the original embeddings. Then, we use DARE as a versatile plug-in to sparsify delta parameters of multiple SFT homologous models for mitigating parameter interference and merge them into a single model by parameter fusing. We experiment with encoder- and decoder-based LMs, showing that: (1) SFT delta parameter value ranges are typically small (within 0.002) with extreme redundancy, and DARE can effortlessly eliminate 90% or even 99% of them; (2) DARE can merge multiple task-specific LMs into one LM with diverse capabilities. Notably, this phenomenon is more pronounced in large-scale LMs, where the merged LM reveals the potential to surpass the performance of any source LM, providing a new discovery. We also utilize DARE to create a merged LM that ranks first among models with 7 billion parameters on the Open LLM Leaderboard.

Citations (169)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates that DARE resets up to 99% of delta parameters without degrading the performance of SFT language models.
  • It employs a simple yet effective scaling mechanism to compensate for removed parameters, affirming the low-rank nature of learned modifications.
  • The study reveals that merging homologous models via DARE enhances multiple functionalities, notably improving zero-shot math task accuracy from 2.2 to 66.3.

Overview of Delta Parameter Redundancy in LMs

This paper presents a novel method, named DARE (Drop And REscale), that effectively demonstrates the redundancy in Supernaturally Fine-Tuned (SFT) LMs, such as those encoded or decoded based on the Llama 2 architecture. It examines the critical insight that while LMs can gain exceptional new capabilities via SFT, the process also introduces excessively redundant "delta parameters" – modifications to the model's parameters that represent the newly learned abilities.

Delta Parameter Reduction with DARE

The core mechanism of DARE is elegantly simple yet powerful. First, a significant proportion of delta parameters are reset to zero. Then, the remaining parameters are scaled up to compensate for the reduction. An intriguing finding is that even when up to 99% of delta parameters are discarded, the performance of LMs remains essentially unchanged. This approach implicitly confirms that LMs tend to learn low-rank structures, suggesting that a vast majority of the parameter changes aren't essential for the new skills the LMs have acquired.

Enhancing LMs through Model Merging

Building on the underpinning DARE methodology, merging multiple homologous LMs – models fine-tuned from the same backbone – becomes viable and can be executed without performance degradation. Specific experimental trials demonstrate this by merging models trained for distinct functionalities such as instruction-following and mathematical reasoning. Interestingly, the combined LM not only retains instructional capabilities but also surpasses the original performance of individual models in mathematical tasks, jumping to an accuracy of 66.3 from 2.2 in zero-shot settings.

Insights on Model Agility and DARE Application Boundaries

The effectiveness of DARE is influenced by the range of delta parameters, with smaller ranges (typically below 0.005) being an optimal condition for its application. If these parameters are largely modified – such as during extended pre-training – DARE becomes infeasible. Furthermore, when DARE's principle is applied to the LM's final fine-tuned parameters rather than delta parameters, the performance impact is severe and negative, marking an important distinction in the way DARE should be applied.

In summary, this paper uncovers a significant level of redundancy inherent in the delta parameters of SFT LMs and proposes a pragmatic solution to this redundancy that not only conserves computational resources but also combines multiple models' strengths without additional training. The accompanying open-source code repository enables further exploration and application of these findings. The implications of such parameter-efficient fine-tuning and model merging strategies are far-reaching, providing both a blueprint for more effective AI models and an advance in understanding the parameter dynamics of LMs.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com