Layerwise Importance Sampled AdamW (LISA)

Updated 1 December 2025

The paper introduces LISA, a novel fine-tuning method that selects important layers via importance sampling to achieve on-par or superior performance compared to full-parameter AdamW.
It employs selective freezing of layers to dramatically reduce GPU memory usage by updating only a dynamically chosen subset during backpropagation.
Empirical evaluations show that LISA converges faster and boosts performance metrics in tasks like MT-Bench, GSM8K, and PubMedQA across various LLMs.

Layerwise Importance Sampled AdamW (LISA) is a memory-efficient fine-tuning strategy for LLMs that leverages the empirical skewness of weight-norm changes across model layers during adaptation. Unlike Low-Rank Adaptation (LoRA), which inserts learnable low-rank adapters at each layer, LISA applies importance sampling to select a small, dynamically changing subset of layers for AdamW optimization, randomly freezing the remainder. This approach maintains or exceeds the fine-tuning performance of both LoRA and full-parameter AdamW, while matching or reducing memory requirements associated with optimizer state and parameter updates (Pan et al., 2024).

1. Background and Motivation

Full-parameter AdamW fine-tuning of LLMs requires substantial GPU memory, as it necessitates storing all gradients, first and second moment buffers per parameter, and activations. For example, a 7B parameter model typically needs at least 60 GB of GPU memory, posing a barrier for researchers without access to large hardware resources.

LoRA significantly reduces trainable parameters by introducing low-rank adapters into each linear layer. However, in many large-scale settings, such as continual pre-training and instruction tuning, LoRA can underperform relative to full fine-tuning, as its parameter search is confined to a low-rank subspace. Detailed examination reveals that LoRA’s layerwise weight-norm changes are highly skewed: only the embedding and head layers exhibit substantial updates, while intermediate self-attention blocks receive minimal changes. By contrast, full-parameter tuning yields more uniform layer updates.

This consistent skewness motivates a strategy that prioritizes “important” layers—those experiencing larger updates—by allocating computational resources preferentially to them while freezing less critical layers. The LISA algorithm operationalizes this insight through stochastic, importance-driven parameter updates.

2. The LISA Algorithm: Core Mechanism

LISA (Layerwise Importance Sampled AdamW) is defined by layerwise importance sampling and selective freezing within the AdamW optimization framework. The algorithm proceeds in the following steps:

For a model with $L$ transformer layers (including embedding and head), total iterations $T$ , sampling interval $K$ , and importance sampling probabilities $p_1, ..., p_L$ , initialize parameters, and AdamW moment buffers.
At every $K$ steps, sample an “active set” $\mathcal{A}$ of layers according to the fixed distribution $p_l$ , with special treatment to always include embedding and head layers. The complement set $\mathcal{M}$ is frozen.
Conduct forward computation through all layers for activation purposes, but during backpropagation, set gradients of frozen layers to zero: $\nabla_{\theta_\ell}\mathcal{L} = 0$ for all $\ell \in \mathcal{M}$ .
Apply AdamW updates only to active layers. For each $\ell \in \mathcal{A}$ :

$g_\ell^{(t)} \leftarrow \nabla_{\theta_\ell} \mathcal{L}(\theta^{(t)})$

$m_\ell^{(t+1)} \leftarrow \beta_1 m_\ell^{(t)} + (1 - \beta_1) g_\ell^{(t)}$

$v_\ell^{(t+1)} \leftarrow \beta_2 v_\ell^{(t)} + (1 - \beta_2) (g_\ell^{(t)})^2$

$\hat{g}_\ell^{(t)} \leftarrow m_\ell^{(t+1)} / (1 - \beta_1^{t+1})$

$\hat{v}_\ell^{(t)} \leftarrow v_\ell^{(t+1)} / (1 - \beta_2^{t+1})$

$\theta_\ell^{(t+1)} \leftarrow \theta_\ell^{(t)} - \eta_t \frac{\hat{g}_\ell^{(t)}}{(\sqrt{\hat{v}_\ell^{(t)}} + \epsilon)} - \eta_t \lambda \theta_\ell^{(t)}$

Frozen layers maintain their current parameter values and moment estimates.

Formally, with active set $\mathcal{A}$ and frozen set $\mathcal{M}$ :

$g_\ell^{(t)} = \begin{cases} \nabla_{\theta_\ell} \mathcal{L}(\theta^{(t)}) & \text{if } \ell \in \mathcal{A} \ 0 & \text{if } \ell \in \mathcal{M} \end{cases}$

$\theta_\ell^{(t+1)} = \begin{cases} \theta_\ell^{(t)} - \eta_t \text{AdamW}(g_\ell^{(t)}) & \ell \in \mathcal{A} \ \theta_\ell^{(t)} & \ell \in \mathcal{M} \end{cases}$

3. Importance Sampling for Layer Selection

A critical component of LISA is the construction of the importance sampling distribution over layers. The theoretical importances $I_\ell$ can be formulated as either the L2 norm of layer parameters, $I_\ell = \|\theta_\ell\|_2$ , or the expected L2 norm of the gradient, $I_\ell = \mathbb{E}[\|\nabla_{\theta_\ell} \mathcal{L}\|_2]$ . These are normalized to derive sampling probabilities:

$p_\ell = \frac{I_\ell}{\sum_{k=1}^L I_k}$

Empirical investigation reveals dominance of embedding and head layer norms, so their probabilities are fixed at 1.0. The remaining probability is distributed uniformly among $\gamma$ selected intermediate layers per interval. For practical purposes, exactly $\gamma$ non-embedding, non-head layers are selected at each $K$ steps, while embedding/head are always active.

4. Memory Complexity and Efficiency

Let $D$ denote the number of scalar parameters, $L$ the layer count, and $r$ the LoRA rank per layer. The memory usage patterns for competing approaches are:

Method	Parameter Size	Optimizer State	Activations	Adapter Overhead
AdamW (full tuning)	$D$	$2D$	proportional to $D$	None
LoRA (rank $r$ )	$D$ + $2D r$	$2D r$ (for adapter states)	proportional to $D$	$2D r$
LISA (with $\gamma$ active)	$D$	$2(\gamma/L)D$ (for active layers only)	proportional to $D$	None

For $\gamma \ll L$ , LISA’s optimizer-state memory is reduced to $2(\gamma/L)D$ , significantly less than the $2D$ required by full AdamW and generally smaller than LoRA’s adapter overhead. Empirical results indicate that, with typical settings ( $r=128$ –$256$, $\gamma=2$ –$4$), LISA uses within 5–10% of LoRA’s peak memory (see Table 1 of (Pan et al., 2024)).

5. Empirical Evaluation

Experiments evaluate LISA, LoRA, and full-tuning across models including GPT2-Small, TinyLlama (1.1B), Phi-2 (2.7B), Mistral-7B, LLaMA-2-7B, and LLaMA-2-70B. Tasks encompass instruction following (Alpaca GPT-4 finetuning, measured by MT-Bench), mathematics (GSM8K), and medical QA (PubMedQA). Key findings include:

MT-Bench (LLaMA-2-7B): LISA ( $\gamma=2$ , $K=3$ ) achieves 5.42, surpassing full-tuning (5.18) and LoRA ( $r=128$ , 4.86), for improvements of +11% vs LoRA and +4.6% vs full-tuning.
MT-Bench (LLaMA-2-70B): LISA ( $\gamma=4$ , $K=50$ ) yields 7.05, exceeding full-tuning (6.66) and LoRA (6.52).
GSM8K (LLaMA-2-70B): Accuracy improves from 59.4% (LoRA) to 61.1% (LISA).
PubMedQA (LLaMA-2-70B): Accuracy increases from 90.8% (LoRA) to 91.6% (LISA).
Gains are larger on smaller models: TinyLlama MT-Bench average rises from 2.03 (LoRA) to 2.78 (LISA, +37%); Mistral-7B from 4.71 (LoRA) to 5.23 (LISA, +11%).
LISA converges faster, as shown in training loss trajectories, and can outperform full-tuning in aspects sensitive to alignment, such as writing and humanities.

6. Ablation and Sensitivity Analysis

Performance trade-offs are analyzed with respect to the number of active layers ( $\gamma$ ) and the sampling interval ( $K$ ):

Increasing $\gamma$ leads to improved MT-Bench scores but higher memory consumption.
Reducing $K$ (more frequent re-sampling) accelerates convergence up to an optimal point.
Sensitivity to sampling randomness is minimal: across three random seeds, MT-Bench score variance is ≤0.13.

7. Implementation and Practical Considerations

LISA is compatible with any PyTorch-style training loop by zeroing gradients for frozen layers or toggling $\text{.requires\_grad}$ on parameters. Recommended hyperparameters:

Learning rate: $5 \times 10^{-5}$ for LISA (and LoRA) on 1B–7B models; $5 \times 10^{-6}$ for full-tuning.
Number of active layers: $\gamma=2$ (7B), $\gamma=4$ (70B).
Sampling interval: $K$ between 3–10 is effective; up to $K=50$ for large-scale runs.
AdamW settings: $\beta_1=0.9$ , $\beta_2=0.999$ , $\epsilon=1\times10^{-6}$ , weight-decay=0.1.
Fixed probabilities for embedding/head layers; uniform allocation for remaining layers.
Can be combined with DeepSpeed ZeRO-Offload or inference-time quantization (e.g., QLoRA) for additional memory savings.

A plausible implication is that LISA constitutes a tractable alternative to adapter-based or full-parameter fine-tuning frameworks for LLMs, particularly in GPU-limited environments. By exploiting skewed utility across transformer layers, it achieves improved or on-par downstream performance with materially reduced memory footprint (Pan et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Layerwise Importance Sampled AdamW (LISA).

Layerwise Importance Sampled AdamW (LISA)

1. Background and Motivation

2. The LISA Algorithm: Core Mechanism

3. Importance Sampling for Layer Selection

4. Memory Complexity and Efficiency

5. Empirical Evaluation

6. Ablation and Sensitivity Analysis

7. Implementation and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Layerwise Importance Sampled AdamW (LISA)

1. Background and Motivation

2. The LISA Algorithm: Core Mechanism

3. Importance Sampling for Layer Selection

4. Memory Complexity and Efficiency

5. Empirical Evaluation

6. Ablation and Sensitivity Analysis

7. Implementation and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research