Word-Level Task Vector Arithmetic
- Word-level task vector arithmetic is a set of techniques that exploit additive linear structures in embeddings and parameter spaces to model morphological transformations, semantic analogies, and task adaptations.
- It leverages explicit manipulation of embedding spaces, achieving high-fidelity analogies (e.g., 80% top-1 accuracy in semantic subspaces) and effective vocabulary compression (up to 10% reduction in entries).
- Task vectors in model parameters enable the fusion of rare-word capabilities, yielding improvements such as a +5 BLEU score boost and robust adaptation across multiple tasks.
Word-level task vector arithmetic refers to a collection of methodologies and analyses that reveal, exploit, or reconfigure the additive linear structure present in word-level representations and parameter deltas in modern language and speech models. These approaches leverage the observation that certain word transformations, analogies, and even downstream capabilities can be succinctly modeled using additive vector operations. Techniques span explicit manipulation of embedding spaces, parameter-difference vectors encoding capabilities, and structured vocabulary designs, with rigorous evidence emerging across both classic word embeddings and large transformer-based language and speech models.
1. Foundations of Vector Arithmetic in Word Representations
The foundational insight driving word-level vector arithmetic is that relations between words—such as morphology (walk→walked), semantic analogy (king–man+woman→queen), or orthographic changes—are often encoded as approximately linear directions in learned embedding spaces. In static word embedding models, such as word2vec with skip-gram with negative sampling (SGNS), this manifests as the empirical regularity that vector operations like reliably solve a range of analogical tasks (Ethayarajh et al., 2018).
The csPMI Theorem provides a formal underpinning: word pairs related by linear vector arithmetic correspond to a constant shift in their co-occurrence shifted PMI, i.e., if and only if for all (Ethayarajh et al., 2018). This property produces principled schemes for analogy solving and motivates the extension to compositional operators for new relations (e.g., weighted sums for semantic composition).
In transformer LMs, similar mechanisms are observed: during prediction, residual-stream representations are manipulated by feedforward subblocks that add near-constant task vectors to perform relational tasks, with a sharp dissociation between abstractive (memory recall) and extractive (context-copying) tasks (Merullo et al., 2023).
2. Explicit Morphological and Compositional Vocabulary Design
“Vocab Diet” demonstrates a methodology for reengineering LLM vocabularies to exploit discovered vector arithmetic at the word level (Reif et al., 19 Oct 2025). The key principle is to move from assigning separate tokens for each inflectional or derivational surface form toward a compositional scheme: each surface form is represented as a sum of its lemma’s embedding and a learned or extracted transformation offset,
where is the base (lemma), and each is a transformation vector for a morphosyntactic function (e.g., past tense, plural).
Transformation vectors are computed as mean offsets for pairs of tokens in the vocabulary that differ by a single known transformation:
with analogous estimation in the output embedding space.
By removing surface-form tokens from the model’s vocabulary and composing them at runtime, this approach eliminates up to 10% of vocabulary entries, reallocates capacity to more diverse words, and enables OOV handling with minimal impact on performance—a difference of 0.8–2.4 points in average downstream metrics on tasks such as TinyMMLU, XNLI, and SQuAD across multiple LLMs and five languages (Reif et al., 19 Oct 2025).
Probing reveals high accuracy for reconstructing regular inflections (e.g., plural: 92–96%) and robust performance even on OOV forms. This compositionality, extracted without retraining the core model, leverages and exposes the latent arithmetic structure present in both input and output embedding matrices.
3. Task Vector Arithmetic in Model Parameters
Beyond word embeddings, task vectors defined over the parameter space of neural models enable modular composition and fast adaptation of capabilities. In speech-to-text, the “rare word” bottleneck is addressed by defining word-level task vectors as difference vectors between model weights fine-tuned on synthetic data for each rare word:
Arithmetic operations on these task vectors (e.g., simple addition, TrIM-and-elect-sign fusion, or dropout-based DARE) allow composition of multiple rare-word capabilities into a single model without further fine-tuning. Application involves creating fused weight sets
which are then directly used for inference. Experimental results match or surpass dedicated fine-tuned models on target-word BLEU and mitigate catastrophic forgetting, with BLEU observed on general test sets compared to both raw and fine-tuned baselines (Jing et al., 26 Dec 2025).
As the number of composed rare-word task vectors grows, performance degrades gracefully but remains robust for up to 4–5 words; interference increases for larger , with performance converging to naive averaging approaches. Limitations include heuristic hyperparameter tuning and cross-task interference for large .
4. Subspace Methods and Layerwise Vector Arithmetic
Attention head analysis in transformer LMs reveals two distinct subspaces supporting vector arithmetic: a semantic “concept” subspace and a “token” (surface-form) subspace (Feucht et al., 22 Nov 2025). Using attribution and attention patterns, heads are ranked for semantic or literal-copying behavior, and subspace projectors are constructed by aggregating the OV matrix products:
Projecting layerwise hidden states into these subspaces enables high-fidelity vector arithmetic:
- Concept subspace: enables semantic analogies, e.g., "Athens" - "Greece" + "China" "Beijing," with 80% top-1 accuracy versus 47% for raw hidden states.
- Token subspace: enables morphological surface operations, e.g., “code” “coding” by vector addition, reaching ~75% accuracy.
This subspace analysis demonstrates that different layers and heads encode distinct compositional functions; selection of heads and projection rank are robust to k and low-rank approximation. A limitation is the current focus on single language (Llama-2-7B), as well as the handling of non-parallelogram/cyclic relations (Feucht et al., 22 Nov 2025).
5. In-Context Learning, Task Vector Recovery, and Theoretical Guarantees
In-context learning (ICL) in transformers—where the model infers a demonstrated mapping from prompt pairs and applies it to new queries—also leverages an internal, additive task vector (Bu et al., 13 Aug 2025). Empirical and theoretical analysis shows that the residual stream encodes a “task vector” that, when added to the prompt token, enables answer prediction:
A theoretical framework demonstrates that training a residual transformer with cross-entropy on QA data lets the model reliably recover the correct task vector and apply the arithmetic operation, achieving arbitrarily low 0–1 test error. This holds even under dictionary or distribution shifts. Mixed-task settings result in the residual encoding a convex combination of multiple task vectors, supporting compositional generalization. Simulations corroborate that QA-based training is necessary for clean vector arithmetic; pure ICL training yields hybrid representations and stagnating error.
These results provide the first theoretical explanation for the emergence and robustness of word-level task vector arithmetic in nonlinear deep transformers (Bu et al., 13 Aug 2025).
6. Broader Implications, Limitations, and Future Directions
Word-level vector arithmetic reveals the surprising linearity accessible within highly nonlinear models, offering interpretable and modular handles for lexical transformation, task adaptation, and vocabulary design. In both input-output embedding and parameter spaces, additive offsets serve as a basis for relation modeling, vocabulary compression, and dynamic capability infusion. Extensions include domain adaptation, robust OOV extrapolation, compositional paraphrasing, and cross-lingual alignment.
There remain limitations in interference for large composite updates, optimal operator design for complex relations, and the automatic discovery and weighting of arithmetic components. Open problems include scaling to large multi-modal systems, layerwise or modulewise fusions, and extending theory to capture more general forms of analogy (beyond parallelogram and surface-level correspondences) (Reif et al., 19 Oct 2025, Jing et al., 26 Dec 2025, Bu et al., 13 Aug 2025).
In sum, the explicit modeling, extraction, and compositional use of transformation vectors provide a principled and empirically supported foundation for interpretable, efficient, and extensible word-level manipulation in modern language and speech models.