Residual Task Vector Quantization (RTVQ)

Updated 15 January 2026

RTVQ is a memory-efficient multi-task learning method that decomposes task vectors into a shared base and per-task residuals for precise quantization.
It employs asymmetric affine quantization to compress the narrow-range differences in task vectors, significantly reducing storage while controlling error.
Empirical results demonstrate that RTVQ achieves up to 92% storage reduction with negligible or improved performance on benchmarks like ViT and ResNet.

Residual Task Vector Quantization (RTVQ) is a method designed for highly memory-efficient model merging in multi-task learning frameworks. It addresses scalability limits imposed by traditional storage of multiple fine-tuned checkpoints by decomposing and quantizing task vector differences with high precision per bit. The technique leverages the statistically narrow range of task vectors to sustain or improve downstream performance while offering substantial reductions in storage requirements (Kim et al., 10 Mar 2025).

1. Formal Definition of Task Vectors and Narrow Range

Given a pre-trained model with parameters $\theta_{\mathrm{pre}} \in \mathbb{R}^n$ , and a collection of fine-tuned checkpoints $\{\theta^{t}_{\mathrm{ft}}\}_{t=1}^T$ for tasks $t=1, \ldots, T$ , the task vector for task $t$ is defined as

$\tau_t = \theta^{t}_{\mathrm{ft}} - \theta_{\mathrm{pre}}.$

Empirical analysis demonstrates that the dynamic range of $\tau_t$ is approximately an order of magnitude narrower than the range of $\theta^{t}_{\mathrm{ft}}$ itself (Figure 1, (Kim et al., 10 Mar 2025)). In the standard asymmetric quantization protocol, this property bounds the per-element rounding error

$| \epsilon | \leq \Delta / 2, \quad \Delta = \frac{\theta_{\max} - \theta_{\min}}{2^b-1},$

where the smaller range of $\tau_t$ ensures significantly reduced quantization noise for bitwidth $b$ . This holds for all model types explored, including vision transformers (ViT-B/32, ViT-L/14) and convolutional nets (ResNet-50).

2. RTVQ Algorithmic Workflow

RTVQ quantizes each task vector in two parts: (1) a shared base vector, and (2) a per-task residual ("offset" vector). The process is:

Compute Task-Average Weight:

$\theta_{\mathrm{avg}} = \frac{1}{T} \sum_{t=1}^T \theta^{t}_{\mathrm{ft}}$

Form Base Vector:

$\tau_{\mathrm{base}} = \theta_{\mathrm{avg}} - \theta_{\mathrm{pre}}$

Quantize Base Vector (to $b_b$ bits):

$\tau_{\mathrm{base}}^q = Q(\tau_{\mathrm{base}}; b_b)$

Error Correction:

$\theta_{\mathrm{avg}}^{\mathrm{ec}} = \theta_{\mathrm{pre}} + \tau_{\mathrm{base}}^q$

Per-Task Offset Vector:

$\tau_{\mathrm{off}}^t = \theta^{t}_{\mathrm{ft}} - \theta_{\mathrm{avg}}^{\mathrm{ec}}$

Quantize Offsets (to $b_o$ bits):

$\tau_{\mathrm{off}}^{q,t} = Q(\tau_{\mathrm{off}}^t; b_o)$

Storage and Reconstruction:

Store only $(\tau_{\mathrm{base}}^q, \{\tau_{\mathrm{off}}^{q,t}\})$ and reconstruct as $\hat{\tau}_t = \tau_{\mathrm{base}}^q + \tau_{\mathrm{off}}^{q,t}$ at merge time.

Here, $Q(\cdot \,;\, b)$ denotes $b$ -bit asymmetric affine quantization.

3. Quantization Principles and Mathematical Formulation

The quantization function is given as follows for any tensor $x$ : $\begin{align*} \Delta_x &= \frac{x_{\max} - x_{\min}}{2^b-1}, \ z_x &= -\mathrm{round}(x_{\min}/\Delta_x), \ x^q &= \mathrm{round}(x/\Delta_x) + z_x, \ \hat{x} &= \Delta_x \cdot (x^q - z_x). \end{align*}$ Applying to RTVQ, the base $\tau_{\mathrm{base}}$ is quantized to $b_b$ bits, and each $\tau_{\mathrm{off}}^t$ to $b_o$ bits. The reconstructed quantized task vector

$\hat{\tau}_t = \hat{\tau}_{\mathrm{base}} + \hat{\tau}_{\mathrm{off}}^t.$

The per-tensor scale and zero-point require negligible additional storage. The algorithm exploits the high quantization-sensitivity of the base, allocating more bits $b_b$ , while offsets—which have extremely compressed range—can be quantized with $b_o$ as low as 2 bits with minimal impact.

4. Bitwidth Allocation and Memory Budget Adaptation

The total bit requirement for $T$ tasks is

$\mathrm{Total \; bits} = b_b + T \cdot b_o.$

Given a per-task memory budget $B$ , the bit allocation satisfies $b_o + (b_b / T) \approx B$ . Practical selection involves sweeping over $b_b \in \{3,4,8\}$ and $b_o \in \{2,3,4\}$ to balance quantization error and resource constraints, as measured by downstream performance. The design principle is that the higher-variance base encodes global features influencing all tasks, while low-variance per-task offsets can be compressed more aggressively.

5. Quantization Error Analysis

Under affine quantization, the maximum $\ell_\infty$ componentwise error for any vector $x$ is

$| \epsilon_i | \leq \frac{x_{\max} - x_{\min}}{2 \cdot (2^b-1)},$

and the mean squared ( $\ell_2$ ) error scales with $\sqrt{n} \Delta_x / 2$ . For RTVQ,

$Q_{\mathrm{err}}(\tau_t, b_t) \lesssim Q_{\mathrm{err}}(\tau_{\mathrm{base}}, b_b) + Q_{\mathrm{err}}(\tau_{\mathrm{off}}^t, b_o)$

by linearity. Empirical results (Fig. 3, (Kim et al., 10 Mar 2025)) show RTVQ reduces overall $\ell_2$ quantization error per bit compared to direct single-stage Task Vector Quantization (TVQ) at ultra-low bitwidths (e.g., 2 bits). RTVQ’s error reduction becomes more pronounced as memory constraints intensify.

6. Empirical Performance and Storage Reduction

RTVQ demonstrates performance that matches or surpasses full-precision (FP32) and standard TVQ baselines while drastically reducing memory. Key results:

ViT-B/32, 8 tasks (classification)
- FP32: 9.1 GB (69.2% accuracy)
- TVQ (4 bits): 1.1 GB (69.1%)
- TVQ (2 bits): 62% accuracy
- RTVQ ( $b_b=3$ , $b_o=2$ ): 0.7 GB (70.2% accuracy, +1% relative to FP32)
Scaling to 14/20 tasks (ViT-B/32, ViT-L/14)
- TVQ degradation shrinks with $T$ ; RTVQ maintains within 1% accuracy of FP32 at $\approx$ 2.2 bits/task.
ResNet-50 NYUv2 (dense prediction)
- 4 bit TVQ: Segmentation (mIoU), Depth (RelErr), and Normal (AngErr) within 0.1–0.5% of FP32.
- 2 bit TVQ: Significant drop (normal error rises from 30.6° to 36°)
- RTVQ (2+2 bits): Within 2° of FP32.
Storage Scaling (ViT-L/14, 20 tasks)
- FP32: 22.8 GB
- 4 bit TVQ: 2.9 GB
- 2 bit TVQ: 1.4 GB
- RTVQ (3+2 bits): 1.7 GB ( $\approx$ 7.5% of FP32)

These results support that memory can be compressed to less than 8% of the original footprint with negligible or even improved merging performance (Kim et al., 10 Mar 2025).

7. Practical Implementation Considerations and Hyperparameters

RTVQ requires strict alignment of the pre-trained backbone $\theta_{\mathrm{pre}}$ at both training and merge time; only task vectors are quantized. The quantization protocol employs per-tensor asymmetric scaling and zero-points. The error-correction step—adding $\theta_{\mathrm{pre}}$ to the quantized base before computing offsets—is crucial when $b_b$ is low, mitigating drift in reconstructions.

Recommended hyperparameters are $b_b = 3$ , $b_o = 2$ as a default, with sweeps over $b_b$ and $b_o$ for specific accuracy/memory trade-offs. The merging frameworks (Task Arithmetic, Ties, EMR, AdaMerging) operate unchanged, substituting quantized task vectors for their full-precision counterparts. For sensitivity tuning, the $\ell_2$ norm of quantization error, averaged across layers or tasks, is the metric of choice.

RTVQ capitalizes on the inherent statistical structure of task vector spaces, delivering storage reductions of up to $\approx 92\%$ without accuracy loss on both classification and dense-prediction benchmarks, and establishes a new benchmark for scalable, memory-efficient model merging (Kim et al., 10 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Task Vector Quantization for Memory-Efficient Model Merging (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Residual Task Vector Quantization (RTVQ).