Papers
Topics
Authors
Recent
Search
2000 character limit reached

Quantized Parameter-Efficient Fine-Tuning (QPEFT)

Updated 9 February 2026
  • Quantized Parameter-Efficient Fine-Tuning (QPEFT) is an adaptation technique that combines low-bit quantization with selective fine-tuning to compress and update large language models efficiently.
  • It employs methods such as QLoRA, GSQ-Tuning, and PEQA to update only a fraction of the parameters, drastically reducing memory, compute, and power requirements.
  • QPEFT enables the deployment of massive models on commodity GPUs, NPUs, and CPUs by leveraging group-wise quantization and specialized adapter architectures to maintain near full-precision accuracy.

Quantized Parameter-Efficient Fine-Tuning (QPEFT) is a class of adaptation techniques for LLMs that fuse quantization—compressing weights and/or activations to low precision—with parameter-efficient fine-tuning, wherein only a small fraction of the model’s parameters are updated for downstream tasks. QPEFT aims to enable adaptation of models with tens to hundreds of billions of parameters on resource-constrained hardware (commodity GPUs, edge NPUs, or CPUs) by minimizing the memory, bandwidth, and computational requirements during both training and inference, without significantly sacrificing task accuracy. Major algorithmic frameworks include quantized low-rank adaptation (e.g. QLoRA), group-shared exponent integerization, scale-only adaptation of quantized weights, and structured weak-column editing, among others. QPEFT methods are characterized by aggressive low-bit quantization (to 8/4/3/2 bits), adapter parameter sharing, explicit outlier/channel management, and compatibility with integer or mixed-precision compute pipelines.

1. Mathematical Foundations of QPEFT

The core principle of QPEFT is to minimize train-time model state and update complexity by separating the model weights into a large, frozen, low-precision backbone and a small trainable component. The standard LoRA scheme, for a weight matrix W0∈Rdout×dinW_0 \in \mathbb{R}^{d_\mathrm{out} \times d_\mathrm{in}}, factors the update ΔW\Delta W as BABA with A∈Rr×dinA \in \mathbb{R}^{r \times d_\mathrm{in}}, B∈Rdout×rB \in \mathbb{R}^{d_\mathrm{out} \times r}, r≪dr \ll d, yielding

W=W0+αrBA.W = W_0 + \frac{\alpha}{r} BA.

QPEFT generalizes this by first quantizing the backbone weights, activations, or both, with a group-wise (or per-channel) function Q(â‹…;â‹…)Q(\cdot;\cdot), e.g.

Wq=Q(W0)∈{0,...,2b−1}dout×din,W_q = Q(W_0) \in \{0, ..., 2^b-1\}^{d_\mathrm{out} \times d_\mathrm{in}},

stored with scale Δ\Delta and zero-point zz per group. Adapter structures, typically stored in FP16/BF16 or low-bit integer, are composed additively or in specialized spectral bases, and only these are updated. Several frameworks extend this model:

  • GSQ-Tuning employs a Group-Shared Exponent Integer (GSEI) format, partitioning weights into blocks of size NN, and representing each with an integer mantissa and a shared 5-bit exponent, enabling all-integer forward and backward passes (Zhou et al., 18 Feb 2025).
  • PEQA restricts updates to quantization scales Δ\Delta for each channel, freezing the integer part of weights Wˉ0\bar{W}_0, so that W^=(S0+Δ)×Wˉ0\hat{W} = (S_0 + \Delta) \times \bar{W}_0, drastically reducing optimizer and storage costs (Kim et al., 2023).
  • QWHA embeds adapters in a Walsh-Hadamard spectral basis, with adaptive parameter allocation to minimize quantization-induced error under a parameter budget (Jeon et al., 22 Sep 2025).
  • QEFT applies structured weak-column updates to quantized blocks, training only a Hessian-selected subset of FP16 columns per layer for maximized loss reduction under a memory constraint (Lee et al., 2024).

2. Quantization Schemes and Format Engineering

QPEFT implementations utilize diverse quantization strategies to shrink the backbone model’s memory and hardware footprint:

  • Group-Shared Exponent Integerization (GSEI): Partitions weight tensors into blocks (e.g., N=32N=32); for each group gg, computes a maximum-exponent ege_g and stores mantissas wq,i=round(wi/2eg)w_{q,i} = \mathrm{round}(w_i / 2^{e_g}), yielding N×m+5N \times m + 5 bits per group, where mm is mantissa bitwidth (Zhou et al., 18 Feb 2025).
  • Per-Channel Linear Quantization: Maps W0W_0 to Wˉ0=clamp(⌊W0/S0⌉+Z0, 0,2b−1)−Z0\bar{W}_0 = \mathrm{clamp}(\lfloor W_0/S_0 \rceil + Z_0,\ 0,2^b-1) - Z_0; updates only S0S_0 during fine-tuning, keeping Wˉ0\bar{W}_0 fixed (Kim et al., 2023).
  • NF4 and Double Quantization (QLoRA): Uses NormalFloat4 4-bit format with a two-level quantization to minimize storage of scaling factors (Abdullah et al., 14 Oct 2025).
  • Walsh-Hadamard and Fourier Spectral Quantization: Represents adaptation coefficients in WHT bases, enabling full-rank coverage of quantization error with minimal updates (Jeon et al., 22 Sep 2025).
  • Block-structural Hybrid (QEFT): After group-wise quantization, selects kk weak columns per layer for FP16 adaptation, the rest quantized to 4 bits, facilitating efficient kernel mapping and near-rank-kk update capability (Lee et al., 2024).

Quantizer hyperparameters are generally selected via grid search, calibration, and, where used, fine-tuning of scale/zero-point for optimal fidelity.

3. Adapter Architectures and Update Mechanisms

Adapters in QPEFT encompass:

  • LoRA-style Adapters: Low-rank matrices A,BA, B updated in FP16/BF16, added to quantized/frozen WqW_q at each forward pass. In GSQ-Tuning, LoRA adapters themselves are integerized and operated on in GSEI format for full pipeline integer-compatibility (Zhou et al., 18 Feb 2025). QLoRA maintains full-precision adapter weights, updating only small matrices after frozen quantized backbone (Abdullah et al., 14 Oct 2025).
  • Scale-Only Adaptation: As in PEQA, only per-channel quantizer scales (Δ\Delta) are updated, resulting in extreme parameter and optimizer state efficiency (Kim et al., 2023).
  • Weak-Column Editing: QEFT selects and adapts a small block of high-impact columns ("weak columns") in FP16, the rest remain quantized and frozen (Lee et al., 2024).
  • Spectral Adapters (QWHA): Adapters as sparse vectors in a Walsh-Hadamard basis, with support allocated adaptively to recover quantization error across high-energy channels (Jeon et al., 22 Sep 2025).

Optimizers typically operate only on the small number of adapter/scalar parameters (AdamW in low precision); STE or custom backward rules handle quantization-aware updates.

4. Hardware Implementation and Integer Compute Pipelines

QPEFT is engineered for hardware efficiency:

  • All-Integer Workflows: GSQ-Tuning demonstrates fully integerized GEMMs, gradients, and parameter updates using GSEI, only dequantizing to BF16 for loss computation or occasional downstream layers. This model enables deployment on integer-only NPUs and achieves chip area and power reductions: e.g., GSE-INT6 engines are 11× smaller and 5× less power-hungry than FP8 at similar accuracy (Zhou et al., 18 Feb 2025).
  • Memory and Throughput: QPEFT reduces fine-tuning memory to a fraction of FP16/32 PEFT; e.g., 6–8 bit GSQ-Tuning achieves 45–55% memory savings, and Quaff reduces peak memory by 30% with 1.73× latency reduction on consumer GPUs (Huang et al., 20 May 2025).
  • Specialized Kernels: QEFT's contiguous block weak-column layout and quantized GEMM mapping allow for highly optimized inference/fine-tuning on GPU/TPU/CPU hardware (Lee et al., 2024).

In all cases, non-linear operations (LayerNorm, Softmax) are typically retained in FP16/BF16 as they contribute negligible compute/memory overhead.

5. Performance Benchmarks and Comparative Evaluation

QPEFT methods generally match or nearly approach full-precision or classic PEFT accuracy:

Model / Task Precision / QPEFT Accuracy (CSQA, MMLU, etc.) Memory/Compute Savings
LLaMA-2-7B LoRA BF16/FP16 65.69% (CSQA suite) —
GSQ-Tuning 8/8/8 All integer (GSE-INT8) 65.60% (∆ −0.09) 45% memory, 5× power ↓
PEQA 4-bit INT4 + scale updates matches LoRA, <0.2–0.4 PPL gap @65B 4× memory ↓
QLoRA 4-bit NF4 + LoRA ≤1% drop Vicuna/MMLU vs FP16 2–3× memory ↓
QEFT (k=128) 4-bit + weak columns 60.9% (few-shot avg 13B) 75% memory ↓, 2–4× speed
Quaff INT8 Outlier-stable INT8 ∆+0.6% over FP32 30% memory, 1.7× speed
QWHA (WHT adapter) 2–4-bit + WHT 2–3 ppt gain vs LoRA/DHT in 2–4b 3–6× speed over FT adapters

Quantization-aware PEFT generally outperforms naïve quantized adaptation (post-PEFT QAT) and maintains accuracy even under very aggressive quantization, especially when advanced adapter architectures and outlier/channel management are applied (Zhou et al., 18 Feb 2025, Jeon et al., 22 Sep 2025, Jeon et al., 2024).

6. Limitations, Best Practices, and Future Research

QPEFT presents a unique set of trade-offs and open questions:

  • Limitations: Most QPEFT schemes target moderate bitwidth (INT8–4). Ultra-low bit settings (e.g., INT2) and >30B parameter models demand further algorithmic and hardware innovations (Huang et al., 20 May 2025, Zhou et al., 18 Feb 2025). Dynamic or task-adaptive quantization (e.g., per-layer/group) remains underexplored.
  • Best Practices: Careful group-size selection (g≈32g \approx 32–$128$), adapter rank (r∼4r\sim4–$16$), and weak-column fraction (k/I∼1%k/I\sim1\%–5%5\%) optimize the accuracy–efficiency regime. Warmup on LoRA-only before quantizing, explicit outlier treatment, and direct integer-kernel mapping are recommended (Lee et al., 2024, Zhou et al., 18 Feb 2025).
  • Extensions: Recent directions include spectral parameter allocation, joint quantization–PEFT–pruning, quantization-aware regularization, and application to multimodal or MoE LLMs (Abdullah et al., 14 Oct 2025, Jeon et al., 22 Sep 2025).
  • Open Problems: Universal applicability of outlier stability hypotheses across domains, hardware-specific kernel design, and full-integer optimization of non-linear operations are leading directions (Huang et al., 20 May 2025). Quantization-induced privacy leakage and robustness in continual/multi-task QPEFT are also active areas.

QPEFT has emerged as a critical tool for democratizing high-quality LLM adaptation, making it feasible to train and deploy personalized, high-performing models on commodity and edge devices with minimal resources, while preserving close-to full-precision accuracy (Zhou et al., 18 Feb 2025, Abdullah et al., 14 Oct 2025, Jeon et al., 2024, Kim et al., 2023, Huang et al., 20 May 2025, Jeon et al., 22 Sep 2025, Lee et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Quantized Parameter-Efficient Fine-Tuning (QPEFT).