Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeltaLLM: Delta-Based Tuning & Compression

Updated 2 February 2026
  • DeltaLLM is a framework that employs delta-based parameterization and temporal sparsity to enable efficient fine-tuning, compression, and inference of large language models.
  • It achieves memory and parameter savings through methods like Delta-LoRA, layer sharing with low-rank delta modules, and mixed-precision delta quantization.
  • Empirical results show that DeltaLLM methods reduce storage by up to 25% and increase inference sparsity to 57–60% with minimal to no loss in model accuracy.

DeltaLLM is a family of methodologies and frameworks centered on delta-based parameterization and temporal sparsity to enable efficient fine-tuning, compression, and inference of LLMs. Approaches under the DeltaLLM umbrella span parameter-efficient adaptation schemes, post-training architecture restructuring, mixed-precision delta compression, and inference-time temporal sparsification. These techniques leverage the notion of representing, updating, or storing only the "delta"—the difference between weight matrices, layers, or attention states—relative to anchor values or prior timesteps. DeltaLLM methods have demonstrated reductions in storage and memory consumption while preserving, and in some cases enhancing, performance relative to standard baselines.

1. DeltaLLM as Parameter-Efficient Fine-Tuning: Delta-LoRA

Delta-LoRA (Zi et al., 2023) generalizes low-rank adaptation methods by updating pre-trained model weights using the delta between consecutive products of trainable low-rank matrices. Standard LoRA-style adaptation fixes the pre-trained weight WW and learns matrices ARc×rA \in \mathbb{R}^{c \times r}, BRr×dB \in \mathbb{R}^{r \times d} such that ΔW=AB\Delta W = AB, with rmin(c,d)r \ll \min(c, d), introducing parameter sparsity and memory savings. However, the restricted update subspace may limit performance.

Delta-LoRA introduces the following update mechanism at optimization step tt: Δ(t)=A(t+1)B(t+1)A(t)B(t),\Delta^{(t)} = A^{(t+1)}B^{(t+1)} - A^{(t)}B^{(t)},

W(t+1)=W(t)+λ(αr)Δ(t),W^{(t+1)} = W^{(t)} + \lambda \cdot \left( \frac{\alpha}{r} \right) \cdot \Delta^{(t)},

where λ\lambda is a scaling coefficient and α\alpha is the LoRA scaling factor. Only AA and BB require gradient computation; WW is updated in a forward-only manner with negligible additional memory and computation beyond LoRA.

Empirical results show Delta-LoRA outperforms LoRA, DyLoRA, and AdaLoRA across e.g., GLUE (RoBERTa) and data-to-text generation (GPT2), achieving improvements (BLEU +1.24, F1 +0.4–1.3) with identical memory requirements. Performance approaches full fine-tuning, especially when dropout is absent in the low-rank branch, confirming L/W=L/(AB)\partial L / \partial W = \partial L / \partial (AB) holds and underpins the method's correctness (Zi et al., 2023).

2. Post-Training Compression by Low-Rank Deltas: Layer Sharing and Delta Modules

DeltaLLM, as introduced in (Mikaelyan et al., 30 Jan 2025), restructures transformer architectures by parameter sharing and low-rank deltas. Rather than maintaining unique parameters per block, selected "anchor" layers are shared across multiple blocks, with each non-anchor block parameterized as: W(+i)=Wshared+ΔW(+i),ΔW(+i)=U(+i)V(+i)TW^{(\ell+i)} = W_{\text{shared}} + \Delta W^{(\ell+i)},\quad \Delta W^{(\ell+i)} = U^{(\ell+i)} V^{(\ell+i)T} where URdout×rU \in \mathbb{R}^{d_{\text{out}} \times r}, VRdin×rV \in \mathbb{R}^{d_{\text{in}} \times r}, rmin(din,dout)r \ll \min(d_{\text{in}}, d_{\text{out}}).

Training employs progressive module replacement (PMR): only the delta modules (U,V)(U, V) are trained using a schedule that stochastically replaces anchor blocks during training, with the rest frozen. Small training data volumes (30M–40M tokens) suffice, and distillation loss terms (cross-entropy, KL) are optimized solely for delta modules.

This approach achieves 12–25% parameter reduction on released models (DeltaPhi, DeltaLlama), retaining ≥90% of original model performance on common benchmarks (MMLU-Pro, WinoGrande, ARC-Challenge, HellaSwag, PIQA). In comparative analyses, DeltaLLM outperforms or matches competing compression techniques for equivalent parameter budgets (Mikaelyan et al., 30 Jan 2025).

3. Delta-Compression and Mixed-Precision Quantization

DeltaLLM-style delta-compression techniques, such as Delta-CoMe (Ping et al., 2024), target multi-tenant scenarios by storing and transmitting only the delta between base and fine-tuned model weights. For a base θb\theta_b and fine-tuned θa\theta_a, only Δ=θaθb\Delta = \theta_a - \theta_b is materialized. Storage cost is reduced to (1+αN)M(1+\alpha N)M for NN models with ratio α1\alpha \ll 1.

Delta-CoMe observes a long-tailed spectrum in the singular values of ΔW\Delta W: ΔW=UΣVT,Σ=diag(σ1,...,σr),σ1...σr.\Delta W = U \Sigma V^T,\quad \Sigma = \text{diag}(\sigma_1, ..., \sigma_r),\quad \sigma_1 \gg ... \gg \sigma_r. A mixed-precision quantization is applied, allocating higher bit-width (e.g., 8 bits) to large singular components, and lower (2–3 bits) to the tail, by partitioning singular vectors into groups based on singular value magnitude:

  • Group 1: i<2i<2, 8 bits
  • Group 2: 2i<342 \leq i<34, 3 bits
  • Group 3: i34i \geq 34, 2 bits

This tailored scheme preserves critical task performance (e.g., math/code tasks), where low-rank truncation or uniform quantization degrade accuracy. Delta-CoMe achieves nearly lossless task performance (<1%<1\% drop at $1/16$ storage cost), outperforming BitDelta/low-rank baselines on Llama-2, Llama-3, and Mistral backbones (Ping et al., 2024).

4. Temporal Sparsity and Efficient Inference on Edge Devices

DeltaLLM also refers to inference-time frameworks for edge deployment that leverage temporal sparsity in attention patterns (Qi et al., 25 Jul 2025). The key insight is that token-wise key vectors k(t)k(t) in attention modules change only incrementally between steps; many Δk(t)=k(t)k^(t1)\Delta k(t) = k(t)-\hat{k}(t-1) are near-zero.

DeltaLLM constructs a sparse ΔK\Delta K, storing only "large" changes (when k(t)k^(t1)>θ\|k(t)-\hat{k}(t-1)\|_\infty > \theta). A context-aware hybrid attention mechanism is employed:

  • Full attention is applied to a local window of recent tokens to preserve local coherence.
  • For older tokens, attention is approximated by accumulating contributions from nonzero Δk\Delta k entries only.

Effective attention sparsity ScS_c is increased from 0% to 57–60% with negligible or slightly positive accuracy impact on SQuAD-v2 and other benchmarks for BitNet and Llama3.2-1B. The framework is training-free, requiring only a modification of the attention computation kernel at inference. Gradual threshold tuning or more sophisticated sparsification may further extend its applicability for longer contexts and efficient KV-cache management (Qi et al., 25 Jul 2025).

5. Empirical Evaluations and Benchmark Comparisons

The various DeltaLLM formulations have been quantitatively assessed:

  • Parameter-efficient fine-tuning (Delta-LoRA): Outperforms LoRA, AdaLoRA, DyLoRA on GLUE, E2E NLG, WebNLG, XSum, with up to 1–2 BLEU/F1 gains and no additional runtime memory impact (Zi et al., 2023).
  • Layer-wise delta compression: Achieves 12–25% parameter reduction (e.g., DeltaPhi) with maintained accuracy. Outperforms JointDrop, LaCo, ShortGPT, SliceGPT for equivalent parameter budgets; e.g., DeltaPhi 2.9B (24% smaller) matches the accuracy of recovery fine-tuned SlicedPhi 3.3B with a 12% reduction (Mikaelyan et al., 30 Jan 2025).
  • Mixed-precision delta quantization (Delta-CoMe): Matches full-precision models on math, code, chat, and vision-language tasks, closing the gap between low-rank and one-bit quantized baselines while operating at $1/16$ the storage footprint (Ping et al., 2024).
  • Temporal sparsity for edge: Achieves up to 57% sparsity with <<1.5% average accuracy drop, and sometimes F1 score improvement, in both prefilling and decoding, compatible with BitNet and Llama models (Qi et al., 25 Jul 2025).

6. Limitations, Integration, and Future Directions

DeltaLLM methods may require hyperparameter tuning (e.g., threshold θ\theta for temporal sparsity, λ\lambda scaling for LoRA-feedback) for specific backbone architectures and tasks. In most cases, delta modules necessitate a brief post-training phase, although some formulations (Delta-CoMe, edge inference) are entirely training-free. For block sharing plus deltas, inference-time speedup is not automatic; further kernel fusion or hardware-specific techniques may be required. Combining deltas with orthogonal techniques—quantization, pruning, structured sparsity—can yield further memory or compute savings.

Future work includes dynamic thresholding, adaptive bit-allocation for delta quantization, per-layer tuning of feedback or anchor positions, integration with quantized attention, and extension to vision or multimodal transformers. Hardware accelerators (ASIC/FPGA) are candidates to exploit the fine-grained sparsity induced by DeltaLLM at inference-time.

DeltaLLM approaches draw upon and generalize low-rank adaptation, singular value decomposition, quantization, layer-sharing, and biological inspirations of delta sparsity. Key differentiators are the explicit exploitation of temporal and inter-layer redundancy, principled memory-aware delta construction, and targeted feedback of lightweight updates into full model or attention computations. Across fine-tuning, compression, storage-efficient serving, and edge inference, DeltaLLM provides foundational mechanisms for efficient LLM deployment in diverse resource constraints with competitive accuracy (Zi et al., 2023, Mikaelyan et al., 30 Jan 2025, Ping et al., 2024, Qi et al., 25 Jul 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeltaLLM.