LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging (2410.17146v2)

Published 22 Oct 2024 in cs.LG and cs.CV

Abstract: Fine-tuning pre-trained models has become the standard approach to endow them with specialized knowledge, but it poses fundamental challenges. In particular, \textit{(i)} fine-tuning often leads to catastrophic forgetting, where improvements on a target domain degrade generalization on other tasks, and \textit{(ii)} merging fine-tuned checkpoints from disparate tasks can lead to significant performance loss. To address these challenges, we introduce LiNeS, Layer-increasing Network Scaling, a post-training editing technique designed to preserve pre-trained generalization while enhancing fine-tuned task performance. LiNeS scales parameter updates linearly based on their layer depth within the network, maintaining shallow layers close to their pre-trained values to preserve general features while allowing deeper layers to retain task-specific representations. In multi-task model merging scenarios, layer-wise scaling of merged parameters reduces negative task interference. LiNeS demonstrates significant improvements in both single-task and multi-task settings across various benchmarks in vision and natural language processing. It mitigates forgetting, enhances out-of-distribution generalization, integrates seamlessly with existing multi-task model merging baselines improving their performance across benchmarks and model sizes, and can boost generalization when merging LLM policies aligned with different rewards via RLHF. Our method is simple to implement, computationally efficient and complementary to many existing techniques. Our source code is available at https://github.com/wang-kee/LiNeS

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel post-training layer scaling approach that prevents catastrophic forgetting while improving model merging performance.
LiNeS employs a linear increase in scaling factors with layer depth to retain general pre-trained features in shallow layers and refine deeper layers for task specificity.
Empirical evaluations show enhanced out-of-distribution generalization, improved merging for vision and language models, and better performance on diverse tasks.

LiNeS: Post-training Layer Scaling for Improved Model Performance

The paper introduces LiNeS, a novel post-training technique designed to tackle catastrophic forgetting in large pre-trained models while enhancing their generalization and performance in model merging tasks. This method exploits layer-increasing network scaling to selectively modify parameter updates after fine-tuning, preserving broad generalization features in shallow layers while maintaining task-specific adaptations in deeper layers.

Background and Motivation

Pre-trained models have reshaped the landscape of machine learning, providing the foundation for numerous applications. However, catastrophic forgetting—where models lose previously acquired knowledge after fine-tuning on new tasks—remains a significant issue. Existing methods such as regularizing fine-tuning or model merging techniques often require complex modifications or introduce task interference, degrading performance across tasks. LiNeS addresses these challenges by a simple, effective post-training adjustment that requires minimal computational overhead.

Methodology

LiNeS applies a layer-wise linear scaling to the task vector, which is the residual difference between the pre-trained and fine-tuned weights. Specifically, the scaling factor increases linearly with layer depth, providing a systematic way to retain general pre-trained model features in shallow layers while refining deeper layers for task specificity. This method is applicable both to single-task scenarios, where it mitigates forgetting, and to multi-task model merging, where it reduces task interference.

Empirical Evaluations

The paper presents comprehensive experiments demonstrating the efficacy of LiNeS across various settings:

Robust Fine-tuning for OOD Generalization: Applying LiNeS to robust fine-tuning methods like WiSE-FT shows improved performance on out-of-distribution datasets, consistently outperforming non-scaled baselines across several distribution shifts.
Multi-task Model Merging: In vision tasks with ViT models and NLP tasks with T5 models, LiNeS significantly enhances existing model merging methods by improving both in-distribution and generalization performance across multiple tasks and architectures.
Single-task Model Merging (Model Soups): LiNeS improves accuracy on ImageNet when merging multiple checkpoints fine-tuned on different hyper-parameter settings, both in uniform and greedy model soup variants.
Rewarded LLM Policy Merging: LiNeS further shows its versatility by improving LLM policies fine-tuned on different rewards, demonstrating Pareto improvements across the preference space.

Implications and Future Directions

LiNeS offers an effective, straightforward post-training solution to mitigate the complexities associated with catastrophic forgetting and task interference. Its scalability and ease of integration with existing methods make it particularly attractive in various applications, from computer vision to LLMs. Future research could explore the extension of LiNeS to other architectures and domains, as well as the integration with advanced learning paradigms such as continual learning or federated learning.

The linear scaling approach, leveraging the inherent properties of neural networks, provides a promising direction for retaining and improving model generalization without substantial computational demands. LiNeS stands out by aligning closely with both theoretical insights into neural network behavior and practical needs for efficient AI deployment.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/nikdimitriadis/status/1849749831436189763

https://twitter.com/fly51fly/status/1850308934357455081

https://twitter.com/papers_anon/status/1849024753107193959

YouTube

Show All Videos

HackerNews

Post-Training Layer Scaling Prevents Forgetting and Enhances Model Merging (1 point, 0 comments)