- The paper introduces a novel post-training layer scaling approach that prevents catastrophic forgetting while improving model merging performance.
- LiNeS employs a linear increase in scaling factors with layer depth to retain general pre-trained features in shallow layers and refine deeper layers for task specificity.
- Empirical evaluations show enhanced out-of-distribution generalization, improved merging for vision and language models, and better performance on diverse tasks.
LiNeS: Post-training Layer Scaling for Improved Model Performance
The paper introduces LiNeS, a novel post-training technique designed to tackle catastrophic forgetting in large pre-trained models while enhancing their generalization and performance in model merging tasks. This method exploits layer-increasing network scaling to selectively modify parameter updates after fine-tuning, preserving broad generalization features in shallow layers while maintaining task-specific adaptations in deeper layers.
Background and Motivation
Pre-trained models have reshaped the landscape of machine learning, providing the foundation for numerous applications. However, catastrophic forgetting—where models lose previously acquired knowledge after fine-tuning on new tasks—remains a significant issue. Existing methods such as regularizing fine-tuning or model merging techniques often require complex modifications or introduce task interference, degrading performance across tasks. LiNeS addresses these challenges by a simple, effective post-training adjustment that requires minimal computational overhead.
Methodology
LiNeS applies a layer-wise linear scaling to the task vector, which is the residual difference between the pre-trained and fine-tuned weights. Specifically, the scaling factor increases linearly with layer depth, providing a systematic way to retain general pre-trained model features in shallow layers while refining deeper layers for task specificity. This method is applicable both to single-task scenarios, where it mitigates forgetting, and to multi-task model merging, where it reduces task interference.
Empirical Evaluations
The paper presents comprehensive experiments demonstrating the efficacy of LiNeS across various settings:
- Robust Fine-tuning for OOD Generalization: Applying LiNeS to robust fine-tuning methods like WiSE-FT shows improved performance on out-of-distribution datasets, consistently outperforming non-scaled baselines across several distribution shifts.
- Multi-task Model Merging: In vision tasks with ViT models and NLP tasks with T5 models, LiNeS significantly enhances existing model merging methods by improving both in-distribution and generalization performance across multiple tasks and architectures.
- Single-task Model Merging (Model Soups): LiNeS improves accuracy on ImageNet when merging multiple checkpoints fine-tuned on different hyper-parameter settings, both in uniform and greedy model soup variants.
- Rewarded LLM Policy Merging: LiNeS further shows its versatility by improving LLM policies fine-tuned on different rewards, demonstrating Pareto improvements across the preference space.
Implications and Future Directions
LiNeS offers an effective, straightforward post-training solution to mitigate the complexities associated with catastrophic forgetting and task interference. Its scalability and ease of integration with existing methods make it particularly attractive in various applications, from computer vision to LLMs. Future research could explore the extension of LiNeS to other architectures and domains, as well as the integration with advanced learning paradigms such as continual learning or federated learning.
The linear scaling approach, leveraging the inherent properties of neural networks, provides a promising direction for retaining and improving model generalization without substantial computational demands. LiNeS stands out by aligning closely with both theoretical insights into neural network behavior and practical needs for efficient AI deployment.