Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 105 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 41 tok/s
GPT-5 High 42 tok/s Pro
GPT-4o 104 tok/s
GPT OSS 120B 474 tok/s Pro
Kimi K2 256 tok/s Pro
2000 character limit reached

DisTaC: Task Vector Conditioning

Updated 5 August 2025
  • The paper demonstrates that DisTaC uses a distillation-based pre-conditioning step to align task vector norms and boost model confidence before merging.
  • The methodology combines task vector rescaling with knowledge distillation to safely integrate fine-tuned parameters even under irregular training regimes.
  • Empirical results show that DisTaC recovers accuracy losses—up to 14-24%—in multi-task merging with minimal computational overhead.

DisTaC (Distillation for Task vector Conditioning) is a methodology for enhancing the robustness and generality of model merging in multi-task learning through targeted pre-conditioning of task vectors via knowledge distillation. It is specifically designed to address critical failure modes—namely, disparities in task vector norms and low source-model confidence—that frequently compromise the effectiveness of task-vector-based model merging. By introducing a distillation-based pre-conditioning step, DisTaC ensures that task vectors can be safely integrated even under challenging or misaligned fine-tuning regimes, leading to significant improvements in downstream task accuracy and model stability (Yoshida et al., 2 Aug 2025).

1. Model Merging and the Role of Task Vectors

Model merging involves synthesizing a multi-task model by combining the parameter differences—termed "task vectors"—obtained from fine-tuning a shared pretrained model on different tasks. Formally, for pretrained parameters θpre\theta_{pre} and task-specific fine-tuned parameters θt\theta_t, the task vector is τt=θtθpre\tau_t = \theta_t - \theta_{pre}. Merging proceeds by aggregating several such vectors (typically via summation or weighted combination), yielding a model that encapsulates the behavior of multiple specialized models within a single set of parameters.

This paradigm has emerged as a flexible and efficient alternative to re-training a monolithic model from scratch on all tasks or maintaining a large ensemble of task-specific models. It also allows practitioners to incorporate new tasks asynchronously, without revisiting past task data or incurring substantial additional compute. However, despite promising results on favorable benchmarks, merging performance in realistic settings is highly sensitive to inconsistencies and pathologies in the constituent task vectors.

2. Failure Modes in Model Merging: Norm Disparity and Low Confidence

Empirical analysis reveals two key vulnerabilities that undermine model merging via task vectors:

  • Task Vector Norm Disparity: Hyperparameters such as learning rate, number of fine-tuning steps, and weight decay applied during task-specific adaptation yield task vectors with substantially different 2\ell_2 norms. For instance, increasing the learning rate from 10510^{-5} to 10410^{-4} can inflate a task vector norm by 5–7×. When merged additively, longer (higher-norm) task vectors dominate, causing poor representation for tasks with shorter vectors—thereby reducing their influence or effectively erasing their contribution.
  • Low Source-Model Confidence: Regularization techniques like label smoothing, Mixup, or using alternative loss functions (e.g., focal loss) are common to improve model calibration, but they introduce high-entropy (low-confidence) output distributions. When merging such models, the elevated entropy is propagated, leading to substantial post-merge degradation—sometimes over 20%—in task performance.

In both scenarios, conventional merging (whether via simple addition or more advanced spectral/consensus schemes) fails to maintain high performance across all tasks and does not recover from these model pathologies (Yoshida et al., 2 Aug 2025).

3. DisTaC Methodology: Distillation-Based Task Vector Conditioning

DisTaC addresses these limitations by integrating a two-component pre-conditioning process based on knowledge distillation, applied to each problematic task vector before merging:

A. Task Vector Norm Conditioning

  • Rescaling: Prior to merging, the task vector τt\tau_t is rescaled by a factor κt\kappa_t to align with a target norm (e.g., the mean or a fixed value shared across all vectors).
  • Distillation for Recovery: Simple scaling of τt\tau_t degrades individual task performance. Thus, DisTaC initializes a student model at θ0=θpre+κtτt\theta_0 = \theta_{pre} + \kappa_t\tau_t and employs knowledge distillation (KD) from the original fine-tuned "teacher." The distillation process:
    • Uses only unlabeled data (no labels needed).
    • Minimizes the divergence between the teacher’s and student’s predictive distributions (KL or softened KL divergence).
    • Includes an 2\ell_2 penalty keeping the student parameters close to the scaled initialization.

B. Confidence Conditioning

  • Temperature Adjustment: The KD loss is modified using asymmetric temperature scaling: logits from the teacher are divided by TtcrT_{tcr} and those from the student by a higher TstuT_{stu}. This pushes the student network toward generating low-entropy (higher confidence) outputs.
  • Loss Formulation: The full distillation loss is

    LKD=(1ζ)LCE(zstu,y)+ζTtcrTstuKL(σ(ztcr/Ttcr)σ(zstu/Tstu))\mathcal{L}_{KD} = (1 - \zeta)\mathcal{L}_{CE}(z_{stu}, y) + \zeta \cdot T_{tcr} \cdot T_{stu}\cdot KL(\sigma(z_{tcr}/T_{tcr}) \| \sigma(z_{stu}/T_{stu}))

    with ζ=1\zeta=1 and standard KD using only soft targets (i.e., soft logit matching without requiring task labels).

This composite loss pre-conditions each task vector to satisfy both norm alignment and high predictive confidence before merging (Yoshida et al., 2 Aug 2025). The process requires only a few hundred gradient steps per task (e.g., \sim500 steps), making it computationally lightweight.

4. Empirical Performance: Robustness and Recovery

The efficacy of DisTaC is validated on an 8-task vision benchmark using CLIP-based architectures. Three experimental regimes are considered: (1) All tasks fine-tuned under identical settings, (2) Norm-mismatch (one or more tasks fine-tuned with higher learning rates), and (3) Low-confidence (fine-tuning with label smoothing). Key observations include:

  • Norm-mismatch: Standard merging methods incur up to 14% accuracy loss due to norm disparities; DisTaC pre-conditioning restores accuracy to near-original levels.
  • Low-confidence: Using label smoothing results in up to 24% performance drop post-merge; after DisTaC, accuracy returns to 92% of original levels (for TSVM merging), representing a recovery of the lost performance.
  • Necessity of Distillation: Ablation demonstrates that norm rescaling without the subsequent KD fails to repair accuracy, highlighting the importance of distillation.
  • Operational Characteristics: Training curves show DisTaC recovers both norm and confidence quickly without adverse side effects; shrinking long task vectors (rather than stretching short ones) is preferable, preserving the pretrained model’s representation locality.

5. Methodological Implications and Extensions

DisTaC demonstrates that pre-conditioning task vectors is essential for robust model merging—rectifying harmful disparities before integration. This sets a new procedural recommendation for any model merging pipeline utilizing task arithmetic, consensus, or spectral merging techniques.

Further, the KD-based pre-conditioning is model-agnostic and does not require access to task labels or ancillary training data, facilitating seamless integration with existing multi-task learning and federated learning systems.

Additional insights include:

  • Calibration trade-off: DisTaC improves source-model confidence (potentially at the cost of perfect calibration), but merged models can be post-calibrated for application-specific requirements.
  • Applicability to Non-Ideal Regimes: The method is especially beneficial when models are fine-tuned separately, possibly under irregular or adversarial conditions, and are only available as task vectors at merge time.
  • Minimal Overhead: The distillation step incurs a negligible computational footprint compared to full model retraining.

6. Relation to Broader Task Vector Conditioning Literature

DisTaC’s approach to task vector conditioning via knowledge distillation extends the paradigm of task arithmetic and task vector manipulation by ensuring that vector properties material for merging—magnitude and entropy—are aligned prior to aggregation. Several related works provide context:

  • Task arithmetic in audio: Separate domain-specific task vectors for speech and music can be interpolated post-distillation, but require matched initialization and careful balancing of representations (Ritter-Gutierrez et al., 19 May 2025).
  • Feature-space distillation in continual learning: Projected latent distillation and other double-distillation techniques can align representations but do not directly address the vector norm or confidence challenges found in task vector merging (Carta et al., 2023).
  • Spectral approaches: Orthogonal projections (as in VkDV_kD) and explicit cross-task regularization provide generalization across tasks but do not directly mitigate norm/entropy issues in vector arithmetic (Miles et al., 10 Mar 2024, Auty et al., 21 Mar 2024).

DisTaC is unique in targeting merge-specific pathologies, cementing its role as an augmentative step for real-world multi-task and federated learning deployments where source models may be unconstrained or poorly aligned.

7. Future Directions

Potential avenues for advancing DisTaC include:

  • Automated norm and confidence target selection: Determining optimal scaling and entropy targets for diverse collections of tasks or in dynamically evolving systems.
  • Extension to other modalities and architectures: Applying similar pre-conditioning to non-vision domains or to models with richer task vector structures.
  • Calibration-aware merging: Integrating DisTaC with post-hoc calibration or uncertainty quantification modules for applications where predictive confidence is mission-critical.
  • Hybrid approaches: Combining DisTaC pre-conditioning with ensemble or consensus merging frameworks, as well as with continual or federated learning protocols accommodating asynchronous task arrival or loss of training provenance.

DisTaC provides a principled mechanism for robust, scalable, and accurate model merging via explicit task vector pre-conditioning, setting a new standard for multi-task distillation and integration in heterogeneous machine learning ecosystems.