Task Arithmetic in Neural Models

Updated 25 August 2025

Task arithmetic is a computational paradigm that encodes learned behaviors as local linear directions in the parameter space, enabling precise model editing.
It applies linear operations on task vectors—differences between fine-tuned and pre-trained weights—to merge, subtract, or transfer specific skills.
Advanced methods like layer selection, closed-form coefficient tuning, and trust-region constraints mitigate interference for robust multitask performance.

Task arithmetic is a computational paradigm wherein trained models—especially large neural networks—are edited in parameter space by linear combinations of weight differences (“task vectors”) corresponding to diverse learned capabilities. Initially popularized as a method for modular model editing, multitask merging, and knowledge transfer in foundation models, task arithmetic now spans technical approaches in vision, language, speech, molecular design, and in-context learning settings. Central to task arithmetic is the view that many functional behaviors acquired via fine-tuning are encoded as local linear directions in parameter space, enabling addition, subtraction, and even analogical recombination of task vectors to synthesize models with new, precise behaviors.

1. Mathematical Foundations of Task Arithmetic

Task arithmetic operates on the concept of a task vector: a vector in the parameter space defined as the difference between the weights of a model post-fine-tuning and its pre-trained parameters. For a model pre-trained with parameters $\theta_0$ and then fine-tuned on task $t$ to weights $\theta_t$ , the task vector is

$\tau_t = \theta_t - \theta_0$

Model editing and merging are performed by manipulating $\theta_0$ via weighted combinations of task vectors:

$\theta_{\text{new}} = \theta_0 + \sum_t \lambda_t \tau_t$

where $\lambda_t$ are scaling coefficients. Negating a task vector (using $-\tau_t$ ) can reverse the acquisition of a skill or remove an undesired behavior (“task forgetting”), while summing multiple vectors can equip a model with several new capabilities or blend task expertise (Ilharco et al., 2022, Zhou et al., 17 Jun 2024).

The linear structure of task arithmetic enables a range of operations:

Operation	Formula	Functionality
Single task edit	$\theta_0 + \lambda\tau_t$	Apply or amplify skill
Task forgetting	$\theta_0 - \tau_t$	Remove skill
Task merging	$\theta_0 + \sum_t \lambda_t \tau_t$	Multi-task composition
Analogical transfer	$\tau_D \approx \tau_C + (\tau_B - \tau_A)$	Solve analogy

This framework also applies to submodules (layers, attn/MLP blocks) (Dai et al., 15 Apr 2025), sparse subregions (He et al., 24 Aug 2024), and low-rank adaptation (LoRA) weights (Chitale et al., 2023, Cheng et al., 17 Sep 2024).

2. Linearity, Disentanglement, and Theoretical Guarantees

The effectiveness of task arithmetic is governed by the degree of linearity and disentanglement in the function-parameter mapping. In practice, local linearity holds to an excellent approximation near initialization in large models, supported theoretically by Neural Tangent Kernel (NTK) analyses and observed empirically in submodules where functional changes induced by weight shifts are nearly linear (Ortiz-Jimenez et al., 2023, Dai et al., 15 Apr 2025).

Disentanglement refers to the orthogonality of task vectors: when $\tau_i^\top\tau_j \approx 0$ for $i \neq j$ , simple addition yields non-interfering composite behaviors. Pre-training typically encourages this property, while fine-tuning in the linearized tangent space further amplifies it. The tangent-space linearization yields:

$f_{\text{lin}}(x; \theta) = f(x; \theta_0) + (\theta-\theta_0)^\top \nabla_\theta f(x; \theta_0)$

and editing in this regime is mathematically equivalent to kernel regression with the NTK (Ortiz-Jimenez et al., 2023).

Localized NTK eigenfunctions guarantee that adding task vectors with disjoint data/support domains yields predictable, non-interfering functional edits. When task vectors overlap, knowledge conflicts arise, and linear disentanglement becomes crucial (Sun et al., 25 Jan 2025).

3. Methods for Improving and Controlling Task Arithmetic

Multiple strategies have been proposed to enhance the reliability and expressivity of task arithmetic:

Module/Layer selection: Editing only key linear layers (particularly within attention modules) increases weight disentanglement and multi-task performance without sacrificing single-task accuracy (Jin et al., 9 Jul 2024). Layer-aware weighting schemes can modulate task vector contributions per layer, e.g. amplifying task-specific layers while attenuating instruction-following components (Chen et al., 27 Feb 2025).
Closed-form coefficient selection: Rather than heuristic or grid search for weights $\lambda_t$ , approaches such as MetaGPT (Zhou et al., 17 Jun 2024) solve for optimal scaling coefficients in closed form, leveraging orthogonality and vector norms:

$\lambda_t = \frac{\|\tau_t\|^2}{\sum_k \|\tau_k\|^2}$

ensuring minimal average loss difference from individual task-specialized models.

Sparse and localized merging: Identifying and stitching only the minimal parameter subset responsible for each task (often $<1\%$ of weights) reduces interference and storage while preserving pre-trained knowledge. Localize-and-Stitch uses learned binary masks to extract these regions (He et al., 24 Aug 2024).
Selective masking by importance metrics: Selective Task Arithmetic (STA) introduces a loss-sensitive, Taylor-based parameter importance metric for each task:

$I_{i,j} = |\theta_i^\top \nabla_{\theta_i} \mathcal{L}(x_j, y_j \mid \Theta)|$

and uses quantile-based masking to filter out unimportant updates, enabling both robust multi-task fusion and targeted task forgetting (Bowen et al., 25 Nov 2024).

Federated learning analogies: By recasting task arithmetic as one-shot Federated Averaging (FedAvg), theoretic bounds and practical improvements from federated learning (FedNova normalization, coordinate-wise median, clipping) can be applied to mitigate data/training heterogeneity effects (Tao et al., 27 Nov 2024).
Trust region constraints: Trust-region-aware merging restricts edits to parameter directions that do not cause large cross-task loss changes, alleviating conflicts in parameter space (Sun et al., 25 Jan 2025).

4. Applications Across Modalities and Domains

Task arithmetic is deployed in a variety of real-world and research settings:

Multitask LLMs: Merging specialized LLMs (e.g., code, math, general reasoning) via task vector addition yields a single model with joint capability, achieving near-individual task performance and enabling knowledge transfer without retraining or sharing data (Zhou et al., 17 Jun 2024, Dai et al., 15 Apr 2025).
Zero-shot and domain adaptation: Adding domain-specific task vectors to general IR models dramatically improves retrieval accuracy in underrepresented or shifted domains without fine-tuning (up to 18% NDCG@10 improvement) (Braga et al., 1 May 2025).
Model editing and fairness: Editing task vectors (including subgroup-specific vectors) can modulate both utility and fairness metrics, such as improving Demographic Parity and Equalized Odds in hate speech detection (Naganuma et al., 30 May 2025).
Speech translation and language expansion: By merging task vectors of single-pair ST models (and controlling for language confusion via a dedicated Language Control task vector), one-to-many speech translation is achieved without retraining on all data. Task analogies enable synthesis for previously unattainable language pairs (Cheng et al., 17 Sep 2024).
Molecule design under label scarcity: Molecular task arithmetic learns property directions from negative examples, inverts them, and generates positive molecules—outperforming supervised finetuning in diversity and hit rate for de novo design tasks (Özçelik et al., 23 Jul 2025).
Visual explainability transfer: Explainability capabilities (“explainability vectors”) learned via explanation supervision are transferred across domains using task arithmetic, providing explanation quality rivaling Kernel SHAP but at orders of magnitude lower inference cost (Yoshikawa et al., 6 Jul 2025).
Continual learning: Combining LoRA-based adapters and task arithmetic enables highly efficient continual vision learning with full catastrophic forgetting avoidance (Chitale et al., 2023).

5. Performance, Limitations, and Theoretical Connections

Empirical results consistently demonstrate that task arithmetic, with appropriate coefficient selection, sparse/staged application, or module-level awareness, preserves >90% specialized accuracy on joint tasks and often rivals both ensemble and full multitask finetuning (Ilharco et al., 2022, Zhou et al., 17 Jun 2024, He et al., 24 Aug 2024). With auxiliary fine-tuning on small memory reservoirs, performance approaches that of full-set fine-tuning at a fraction of the computational cost (Chitale et al., 2023).

However, limitations and challenges persist:

The approach mandates architectural compatibility—tasks vectors are only directly composable between models with identical architectures and (often) pretraining (Ilharco et al., 2022).
Knowledge conflicts can degrade individual task accuracy if vectors are not sufficiently disentangled. Trust-region constraint, sparse localization, or selective masking address but do not entirely eliminate this issue.
The success of analogy-based task arithmetic is sensitive to the relatedness (measured via vector similarity) of source and target domains (Yoshikawa et al., 6 Jul 2025).
Optimal coefficient scaling ( $\lambda_t$ ) may benefit from small validation sets, particularly when merging more than two tasks or in highly heterogeneous domains (Braga et al., 1 May 2025, Tao et al., 27 Nov 2024).

Task arithmetic is mathematically connected to local linearization (NTK theory) and one-shot federated model averaging, providing both theoretical justification for its linear regime and practical routes for improvement.

6. Conceptual Extensions and Future Prospects

Current research extends task arithmetic in several directions:

Layer-wise and submodule merging: Exploits strongly linear submodules for efficient, closed-form modular merging, further enhancing accuracy and robustness (Dai et al., 15 Apr 2025, Chen et al., 27 Feb 2025).
Active task vector synthesis: Analogies, interpolation/extrapolation, and “vector arithmetic” can generate novel skills and capabilities beyond direct finetuning, and can synthesize language pairs or capabilities missing from explicit data (Ilharco et al., 2022, Cheng et al., 17 Sep 2024, Yoshikawa et al., 6 Jul 2025).
Efficient continual learning: Parameter-efficient approaches (such as LoRA, “mask and difference” storage) combined with task arithmetic enable rapid continual skill composition with minimal memory (Chitale et al., 2023, He et al., 24 Aug 2024).
Fairness and responsible editing: Combining demographic/group-specific vectors in a controllable manner opens avenues for balancing utility and social fairness in practical deployments (Naganuma et al., 30 May 2025).
Theoretical understanding of in-context learning: Provable frameworks show that transformers exploit vector arithmetic over latent task vectors, generalizing and composing knowledge via high-level, linear mechanisms that static word embeddings cannot match (Bu et al., 13 Aug 2025).

Ongoing research seeks to further improve weight disentanglement, scaling techniques to multimodal and heterogeneous architectures, automate optimal mask or coefficient selection, and paper the boundaries of locality and linearity in deep neural models.

Task arithmetic formalizes and exploits local linear structure in parameter space to edit, merge, augment, or forget behaviors in neural networks efficiently and scalably. Its applications, theoretical links, and algorithmic variants constitute a rapidly expanding subfield at the intersection of model editing, transfer learning, and modularity in large-scale deep networks.