Papers
Topics
Authors
Recent
Search
2000 character limit reached

Selective Task Arithmetic

Updated 26 June 2026
  • Selective Task Arithmetic (STA) is a suite of methods that leverages weight-space selectivity to merge or edit task-specific networks while preserving task orthogonality.
  • STA employs submodule-wise, parameter-wise, and attention-only strategies to minimize destructive interference and enable training-free multi-task composition.
  • Empirical results show that STA improves accuracy and selective forgetting in vision and language models using techniques like Taylor-based importance and layer-aware reweighting.

Selective Task Arithmetic (STA) refers to a suite of methodologies for merging, editing, or disentangling multiple task-specific networks via weight-space operations that exploit structural, statistical, or semantic selectivity, as opposed to naïve whole-network arithmetic. STA methods aim to enable training-free, scalable, and interference-minimized composition of neural model capabilities by leveraging layer-wise, submodule-wise, or importance-masked approaches. The guiding principle is that careful selection—whether of submodules, parameters, or compositions—can preserve task orthogonality, mitigate destructive interference, and yield state-of-the-art multi-task and forgetting performance in both vision and LLMs. Key families of STA include parameter-selective fusion via Taylor-based importance metrics, submodule-linear merges, attention-only fine-tuning, and layer-aware reweighting.

1. Formal Task Arithmetic and Selectivity Principles

Task arithmetic originated as direct additive operations in the parameter space of neural networks. Given a pre-trained checkpoint θ0\theta_0 and fine-tuned weights θt\theta_t on tasks t=1,,Tt=1,\ldots,T, define task vectors τt=θtθ0\tau_t = \theta_t - \theta_0. The canonical arithmetic constructs multi-task models as θ=θ0+tαtτt\theta = \theta_0 + \sum_t \alpha_t \tau_t, and expects that on input from domain DtD_t, F(x;θ)F(x;\theta) reflects only τt\tau_t, not other tasks—a property known as weight disentanglement (Ilharco et al., 2022, Jin et al., 2024).

STA extends this paradigm by introducing selectivity, i.e., only fusing or editing certain parts of the network or specific parameter subsets. Standard variants include:

  • Submodule-wise selectivity: Add or merge task vectors at the level of layers, attention blocks, or MLPs, with per-submodule weights (e.g., θmergei=θ0i+tαti(θtiθ0i)\theta^i_{\mathrm{merge}} = \theta^i_0 + \sum_t \alpha_t^i(\theta^i_t - \theta^i_0)) (Dai et al., 15 Apr 2025).
  • Parameter-wise selectivity: Fuse only parameters with demonstrated loss-sensitive importance as determined by first-order Taylor approximations of loss change (Bowen et al., 2024).
  • Attention-module selectivity: Fine-tune solely the attention projection matrices (Q, K, V, O) and freeze the remainder to achieve quasi-linear and orthogonal task updates (Jin et al., 2024).
  • Layer-aware selectivity: Assign weights to layerwise task vector components according to their alignment with instruction-following vs. task-specific knowledge (Chen et al., 27 Feb 2025).

Selectivity is operationalized through metrics such as cosine similarity between parameter vectors, projection distances in feature space, and importance-thresholded masking.

2. Submodule and Parameter Importance Metrics

Beyond global weight arithmetic, STA methods employ importance metrics to identify and mask parameters or submodules that are critical for specific tasks:

  • Loss-sensitive importance: For each parameter θi\theta_i, compute θt\theta_t0 and aggregate across validation samples (Bowen et al., 2024). High θt\theta_t1 indicates importance for that task.
  • Non-linearity and projection distance: Submodules are assessed for linearity by comparing interpolation in weight space to output space behavior. A low non-linearity score or projection distance signals suitability for independent merging (Dai et al., 15 Apr 2025).
  • Layerwise instruction alignment: In models with mixed instruction-following and task-specific adaptation (e.g., LLMs), cosine similarity between layerwise updates induced by instruction tuning and further task tuning guides selective amplification or attenuation (Chen et al., 27 Feb 2025).

Sparsification via importance masking (e.g., keeping only the top θt\theta_t2 quantile by θt\theta_t3) reduces noise and interference upon merging.

3. Selective Merging Algorithms and Closed-Form Solutions

STA frameworks provide efficient, training-free merging:

  • Submodule-level closed-form: For each submodule θt\theta_t4, optimal merging weights θt\theta_t5 are computed by solving a least-squares linear regression in feature space derived from forward passes over task data (Dai et al., 15 Apr 2025). This approach leverages the observed near-linearity of submodule transformations to merge without retraining.
  • Parameter-wise sparsity: Task vectors θt\theta_t6 are masked to obtain θt\theta_t7, with only parameters surpassing the per-layer, per-task importance threshold retained. Fused models are then θt\theta_t8, eliminating the need for coefficients (Bowen et al., 2024).
  • Greedy and weighted composition: Classical STA also includes iterative greedy selection by validation gain or orthogonality for general task vector selection (Ilharco et al., 2022).

Pseudocode and matrix formulae are detailed in the respective works for replication and adaptation.

4. Experimental Evaluations and Empirical Gains

STA methods consistently improve over global task arithmetic and non-selective baselines:

  • Attention-only STA: In ViT models (CLIP ViT-B/32, B/16, L/14), fine-tuning only Q/K/V/O yields multi-task accuracy of 78.4% (absolute) and normalized addition accuracy of 87.4%—respectively 2.4% and 8.4% higher than NTK linearization or classic TA (Jin et al., 2024).
  • Submodule-linear STA (LLMs): Layer- or Attn/MLP-wise merges produce gains up to +2.8–3.3 points (e.g., Llama-2-13B: STA at attn/MLP level, 51.05% vs 48.2% for global TA) and relative gains up to 15% as the number of tasks increases (Dai et al., 15 Apr 2025).
  • Importance-based sparse STA: On ViT-B/16, STA using loss-preservation achieves 82.84% average accuracy in six-task fusion, improving by +0.45 percentage points over previous SOTA PCB-Merging. For forgetting, control accuracy on unrelated tasks increases by +3.62pp compared to naïve subtraction (Bowen et al., 2024).
  • Layer-aware STA (LATA): Layer-wise reweighted merges, e.g. in Gemma-2-9B or Llama-3-8B, yield the highest per-task accuracy and utility (GSM8K accuracy improved by +0.019 over TA, general perplexity maintained), and enable selective forgetting with minimal collateral damage (Chen et al., 27 Feb 2025).

Ablation studies corroborate the importance of selective module choice (e.g., biases, MLPs, or additional linear layers can degrade disentanglement in ViT attention-only STA) and the effectiveness of data-efficient estimation (robust performance with 30 samples per task in submodule analysis).

5. Disentanglement, Interference, and Theoretical Insights

The central motivation for STA is the reduction of cross-task interference (“weight disentanglement”). Analytical results and visualizations confirm:

  • Kernel-like regime: Selective attention-only fine-tuning, as well as layer-wise or submodule-wise merging, keeps parameter changes in a locally linear (“NTK regime”) space, so task vectors are nearly orthogonal and outputs disentangle per domain (Jin et al., 2024, Dai et al., 15 Apr 2025).
  • Representation vs. task head: Tuning the representation module (e.g., attention Q/K/V/O) enhances disentanglement, while adjusting classification heads introduces non-linear, inseparable effects (Jin et al., 2024).
  • Submodule linearity: Empirically, submodules (heads, layers, MLPs) have 2–10θt\theta_t9 higher linearity than the total model. Global merges are more likely to violate linear assumptions and induce feature mixing (Dai et al., 15 Apr 2025).
  • Importance-based masking: High-importance parameter selection targets truly task-specific adaptations, preventing the transfer of spurious or irrelevant changes across tasks (Bowen et al., 2024).

6. Selective Forgetting and Robust Generalization

STA also supports “selective forgetting”—removing task capability while preserving unrelated skills:

  • Negation with masking: Subtracting sparse, importance-masked task vectors enables task forgetting with minimal side effects on other functionalities. For example, in ViT-CLIP, average accuracy drop on unrelated controls was much smaller than with full-vector negation (Bowen et al., 2024).
  • Layer-aware subtraction: LATA reweights only those layers least aligned with instruction-following, allowing subtraction that is strongly task-specific and robust across multiple output languages (Chen et al., 27 Feb 2025).

The masking and reweighting strategies both support reliable multi-task fusion and selective forgetting.

7. Limitations and Future Research

STA methodologies present several open challenges:

  • Coefficient selection: Most methods employ either shared or per-submodule coefficients; fully data-adaptive or input-conditional strategies remain largely unexplored.
  • Model scale and heterogeneity: All reviewed methods require fixed architectures and shared pre-training seed; merging across heterogeneous initialization or architectures demands new solutions (Ilharco et al., 2022).
  • Bias terms and compression: Effects of tuning/merging bias parameters or low-rank compressions are not fully understood—preliminary results suggest that including biases may degrade disentanglement (Jin et al., 2024).
  • Gradient and memory costs: Importance-metric-based methods require gradient computations and parameter storage, which could be limiting for extremely large models or numerous tasks (Bowen et al., 2024).
  • Analogical and cross-architectural merging: While basic arithmetic supports analogical composition, robust analogical merging under deeper or more heterogeneous task relationships is not yet solved.

A plausible implication is that further theoretical work on the geometry of fine-tuning and task vector compositionality, as well as scalable, prediction-time selectivity mechanisms, are likely to be productive avenues.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Selective Task Arithmetic (STA).