Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs (2406.16797v2)

Published 24 Jun 2024 in cs.CL and cs.AI

Abstract: Existing methods for adapting LLMs to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over lottery tickets (or sparse task vectors), LoTA also enables model merging over highly dissimilar tasks. Our code is made publicly available at https://github.com/kiddyboots216/lottery-ticket-adaptation.

Citations (5)

Summary

  • The paper introduces Lottery Ticket Adaptation to isolate sparse subnetworks for task-specific fine-tuning, reducing destructive interference.
  • It employs a three-phase process—mask calibration, mask extraction, and sparse adaptation—to mitigate catastrophic forgetting in multi-task learning.
  • Experimental results show that LoTA outperforms full fine-tuning and LoRA, ensuring robust performance across diverse tasks.

Insights into Lottery Ticket Adaptation: Addressing Destructive Interference in LLMs

The paper, "Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs," presents a novel approach to adapting LLMs for multi-task scenarios without the prevalent issue of destructive interference. This method, termed Lottery Ticket Adaptation (LoTA), leverages the concept of sparse adaptation by identifying a minimally sized subnetwork of the model for fine-tuning, thereby reducing interference and catastrophic forgetting typically seen in conventional adaptation methods.

Context and Motivation

LLMs like GPT-4 and variants in the LLaMA series have demonstrated remarkable capabilities across a diverse range of tasks, from natural language understanding to complex reasoning and coding. However, adapting these models to diverse tasks simultaneously has been challenging. Traditional methods that adjust all model weights often cause long-standing challenges such as catastrophic forgetting, where adapting to new tasks compromises performance on previously learned tasks. Existing solutions like Low-Rank Adaptation (LoRA) attempt to mitigate this but often fall short on performance compared to full fine-tuning, especially for complex tasks.

The motivation behind LoTA is to capitalize on the observations from the Lottery Ticket Hypothesis, which suggests that within large models, there exist sparse subnetworks capable of achieving comparable performance to the full network when trained in isolation. Adapting this hypothesis to fine-tuning LLMs, the authors propose a technique that not only efficiently tunes models for multiple tasks but also provides robustness against forgetting previously learned tasks.

Methodology

Lottery Ticket Adaptation (LoTA) is a three-phase process:

  1. Mask Calibration: Initially, the model undergoes a brief period of standard fine-tuning on the task at hand to identify the subnetwork crucial for this task. This is a critical step where the task vector—essentially the difference in weights pre- and post-fine-tuning—is calculated.
  2. Mask Extraction: From the task vector, a sparsity mask is derived by identifying significant weights, resulting in a sparse subnetwork representation for the task.
  3. Sparse Adaptation: Finally, the model is reset to its pre-trained weights, and the sparse subnetwork is fine-tuned for the task, with the rest of the parameters kept frozen.

This process ensures that the adapted model weights are specific to the task, reducing destructive interference and maintaining task performance even when switching contexts or tasks.

Applications and Results

Experimentally, LoTA demonstrates superior performance across various tasks including instruction following, reasoning, and math, showing consistently better results than both full fine-tuning and LoRA. The ability of LoTA to prevent catastrophic forgetting was evident when models adapted with LoTA sustained performance on earlier tasks better than those adapted with other methods.

Furthermore, the paper discusses applications in model merging, where adapted models aiming for different tasks can be combined into a single, multitask-capable model more effectively than with previous methods reliant on post-hoc sparsification.

Implications and Future Directions

The implications of LoTA extend beyond effective multi-task adaptation; it offers insights into efficient model operations where computational and memory resources are constraints. LoTA's sparse training methodology also promises improvements in areas like federated learning, where communication efficiency is paramount, and in enhancing model safety, particularly in preserving alignment features across updates.

Looking forward, refining the efficiency of the mask calibration phase and further reducing the computational overhead remains essential for scaling LoTA. Additionally, exploring hybrid models that integrate LoTA with other adaptation paradigms may unlock even greater efficiencies and performance.

In conclusion, LoTA presents a compelling case for reevaluating sparse adaptation methodologies in LLMs, promising robust and efficient multi-task learning paradigms set to influence future developments in model training and adaptation strategies.

Reddit Logo Streamline Icon: https://streamlinehq.com