Residual Task Vector Quantization (RTVQ)
- RTVQ is a memory-efficient multi-task learning method that decomposes task vectors into a shared base and per-task residuals for precise quantization.
- It employs asymmetric affine quantization to compress the narrow-range differences in task vectors, significantly reducing storage while controlling error.
- Empirical results demonstrate that RTVQ achieves up to 92% storage reduction with negligible or improved performance on benchmarks like ViT and ResNet.
Residual Task Vector Quantization (RTVQ) is a method designed for highly memory-efficient model merging in multi-task learning frameworks. It addresses scalability limits imposed by traditional storage of multiple fine-tuned checkpoints by decomposing and quantizing task vector differences with high precision per bit. The technique leverages the statistically narrow range of task vectors to sustain or improve downstream performance while offering substantial reductions in storage requirements (Kim et al., 10 Mar 2025).
1. Formal Definition of Task Vectors and Narrow Range
Given a pre-trained model with parameters , and a collection of fine-tuned checkpoints for tasks , the task vector for task is defined as
Empirical analysis demonstrates that the dynamic range of is approximately an order of magnitude narrower than the range of itself (Figure 1, (Kim et al., 10 Mar 2025)). In the standard asymmetric quantization protocol, this property bounds the per-element rounding error
where the smaller range of ensures significantly reduced quantization noise for bitwidth . This holds for all model types explored, including vision transformers (ViT-B/32, ViT-L/14) and convolutional nets (ResNet-50).
2. RTVQ Algorithmic Workflow
RTVQ quantizes each task vector in two parts: (1) a shared base vector, and (2) a per-task residual ("offset" vector). The process is:
- Compute Task-Average Weight:
- Form Base Vector:
- Quantize Base Vector (to bits):
- Error Correction:
- Per-Task Offset Vector:
- Quantize Offsets (to bits):
- Storage and Reconstruction:
Store only and reconstruct as at merge time.
Here, denotes -bit asymmetric affine quantization.
3. Quantization Principles and Mathematical Formulation
The quantization function is given as follows for any tensor : Applying to RTVQ, the base is quantized to bits, and each to bits. The reconstructed quantized task vector
The per-tensor scale and zero-point require negligible additional storage. The algorithm exploits the high quantization-sensitivity of the base, allocating more bits , while offsets—which have extremely compressed range—can be quantized with as low as 2 bits with minimal impact.
4. Bitwidth Allocation and Memory Budget Adaptation
The total bit requirement for tasks is
Given a per-task memory budget , the bit allocation satisfies . Practical selection involves sweeping over and to balance quantization error and resource constraints, as measured by downstream performance. The design principle is that the higher-variance base encodes global features influencing all tasks, while low-variance per-task offsets can be compressed more aggressively.
5. Quantization Error Analysis
Under affine quantization, the maximum componentwise error for any vector is
and the mean squared () error scales with . For RTVQ,
by linearity. Empirical results (Fig. 3, (Kim et al., 10 Mar 2025)) show RTVQ reduces overall quantization error per bit compared to direct single-stage Task Vector Quantization (TVQ) at ultra-low bitwidths (e.g., 2 bits). RTVQ’s error reduction becomes more pronounced as memory constraints intensify.
6. Empirical Performance and Storage Reduction
RTVQ demonstrates performance that matches or surpasses full-precision (FP32) and standard TVQ baselines while drastically reducing memory. Key results:
- ViT-B/32, 8 tasks (classification)
- FP32: 9.1 GB (69.2% accuracy)
- TVQ (4 bits): 1.1 GB (69.1%)
- TVQ (2 bits): 62% accuracy
- RTVQ (, ): 0.7 GB (70.2% accuracy, +1% relative to FP32)
- Scaling to 14/20 tasks (ViT-B/32, ViT-L/14)
- TVQ degradation shrinks with ; RTVQ maintains within 1% accuracy of FP32 at 2.2 bits/task.
- ResNet-50 NYUv2 (dense prediction)
- 4 bit TVQ: Segmentation (mIoU), Depth (RelErr), and Normal (AngErr) within 0.1–0.5% of FP32.
- 2 bit TVQ: Significant drop (normal error rises from 30.6° to 36°)
- RTVQ (2+2 bits): Within 2° of FP32.
- Storage Scaling (ViT-L/14, 20 tasks)
- FP32: 22.8 GB
- 4 bit TVQ: 2.9 GB
- 2 bit TVQ: 1.4 GB
- RTVQ (3+2 bits): 1.7 GB (7.5% of FP32)
These results support that memory can be compressed to less than 8% of the original footprint with negligible or even improved merging performance (Kim et al., 10 Mar 2025).
7. Practical Implementation Considerations and Hyperparameters
RTVQ requires strict alignment of the pre-trained backbone at both training and merge time; only task vectors are quantized. The quantization protocol employs per-tensor asymmetric scaling and zero-points. The error-correction step—adding to the quantized base before computing offsets—is crucial when is low, mitigating drift in reconstructions.
Recommended hyperparameters are , as a default, with sweeps over and for specific accuracy/memory trade-offs. The merging frameworks (Task Arithmetic, Ties, EMR, AdaMerging) operate unchanged, substituting quantized task vectors for their full-precision counterparts. For sensitivity tuning, the norm of quantization error, averaged across layers or tasks, is the metric of choice.
RTVQ capitalizes on the inherent statistical structure of task vector spaces, delivering storage reductions of up to without accuracy loss on both classification and dense-prediction benchmarks, and establishes a new benchmark for scalable, memory-efficient model merging (Kim et al., 10 Mar 2025).