- The paper demonstrates that DARE resets up to 99% of delta parameters without degrading the performance of SFT language models.
- It employs a simple yet effective scaling mechanism to compensate for removed parameters, affirming the low-rank nature of learned modifications.
- The study reveals that merging homologous models via DARE enhances multiple functionalities, notably improving zero-shot math task accuracy from 2.2 to 66.3.
Overview of Delta Parameter Redundancy in LMs
This paper presents a novel method, named DARE (Drop And REscale), that effectively demonstrates the redundancy in Supernaturally Fine-Tuned (SFT) LMs, such as those encoded or decoded based on the Llama 2 architecture. It examines the critical insight that while LMs can gain exceptional new capabilities via SFT, the process also introduces excessively redundant "delta parameters" – modifications to the model's parameters that represent the newly learned abilities.
Delta Parameter Reduction with DARE
The core mechanism of DARE is elegantly simple yet powerful. First, a significant proportion of delta parameters are reset to zero. Then, the remaining parameters are scaled up to compensate for the reduction. An intriguing finding is that even when up to 99% of delta parameters are discarded, the performance of LMs remains essentially unchanged. This approach implicitly confirms that LMs tend to learn low-rank structures, suggesting that a vast majority of the parameter changes aren't essential for the new skills the LMs have acquired.
Enhancing LMs through Model Merging
Building on the underpinning DARE methodology, merging multiple homologous LMs – models fine-tuned from the same backbone – becomes viable and can be executed without performance degradation. Specific experimental trials demonstrate this by merging models trained for distinct functionalities such as instruction-following and mathematical reasoning. Interestingly, the combined LM not only retains instructional capabilities but also surpasses the original performance of individual models in mathematical tasks, jumping to an accuracy of 66.3 from 2.2 in zero-shot settings.
Insights on Model Agility and DARE Application Boundaries
The effectiveness of DARE is influenced by the range of delta parameters, with smaller ranges (typically below 0.005) being an optimal condition for its application. If these parameters are largely modified – such as during extended pre-training – DARE becomes infeasible. Furthermore, when DARE's principle is applied to the LM's final fine-tuned parameters rather than delta parameters, the performance impact is severe and negative, marking an important distinction in the way DARE should be applied.
In summary, this paper uncovers a significant level of redundancy inherent in the delta parameters of SFT LMs and proposes a pragmatic solution to this redundancy that not only conserves computational resources but also combines multiple models' strengths without additional training. The accompanying open-source code repository enables further exploration and application of these findings. The implications of such parameter-efficient fine-tuning and model merging strategies are far-reaching, providing both a blueprint for more effective AI models and an advance in understanding the parameter dynamics of LMs.