Subtraction in LLMs

Updated 6 November 2025

Subtraction in LLMs is defined as computing a-b with challenges in handling non-commutative operations and correct negative sign output.
Analysis shows pretrained models often omit the negative sign for a-b when a < b, despite internal states accurately encoding the sign.
Instruction tuning with explicit negative-result examples significantly boosts sign accuracy, aligning subtraction performance closer to addition.

Subtraction in LLMs

Subtraction in LLMs refers to both the explicit task of arithmetic subtraction (computing $a-b$ for numbers $a, b$ ) and a range of related operations involving the use of subtraction on model parameters or internal representations for model editing, ability isolation, or knowledge removal. Subtraction occupies a foundational role in evaluating and improving LLMs’ mathematical reasoning, compositionality, model modularity, and knowledge management capabilities. Unlike addition, subtraction is non-commutative and requires attention to directionality and sign, factors which have exposed both computational and representational limitations in contemporary LLMs.

1. Arithmetic Subtraction Capabilities and Error Patterns

The arithmetic subtraction task evaluates an LLM's ability to produce correct outputs for expressions of the form $a-b$ , with special interest in cases where $a < b$ (requiring a negative result). Comprehensive benchmarking across open-source LLMs (Gemma-2, Qwen3, OLMo-2, Llama-3; 8–70B parameters) demonstrates a persistent and substantial gap between addition and subtraction (Jobanputra et al., 4 Nov 2025). While addition accuracies are typically near perfect (≥99%) for single-token numbers, subtraction accuracies drop precipitously—often by 30–50 percentage points, with pretrained models averaging only ~56% accuracy (see Table below).

LLM	Addition	Subtraction (a < b, w/ sign)	Subtraction (a < b, magnitude only)
Qwen3-8B	~100%	4.44%	36.89%
Llama-3-8B	~99%	8.13%	33.51%
OLMo-2-32B	~99%	13.65%	96.55%

The dominant error mode in subtraction is omission of the negative sign in cases where $a < b$ . In these instances, models nearly always produce the correct magnitude $|a-b|$ but fail to emit the leading "-"; e.g., for $2-5$, outputting "3" in place of "-3". Accuracy for $a-b$ with $a > b$ is typically near addition-level, revealing an asymmetry tied to sign handling rather than numerical computation per se. This behavior is robust to prompt formulation, operand reordering, and number tokenization granularity, and persists for multi-token numbers.

2. Probing and Representation of Negative Results

Linear probing analyses expose a crucial disconnect: although the output layer of LLMs frequently omits the negative sign, internal hidden states almost perfectly encode whether a given $a-b$ instance should produce a negative result. Probes trained on model states (Qwen3-8B, Gemma-2-9B, Llama-3-8B) achieve ~99.95% accuracy in classifying the necessary sign, both for single-token and multi-token numbers (Jobanputra et al., 4 Nov 2025). This indicates that the LLM’s numeracy “knows” when output should be negative, but this information is not reflected in generation. The translation of scalar or symbolic attributes (such as sign) from internal computation to language output thus forms a key failure point for subtraction in LLMs.

LLM	Single-token	Multi-token	Combined
Gemma-2 9B	100.00	99.60	99.77 ± 0.08
Llama-3.1 8B	99.66	99.10	99.37 ± 0.09
Qwen3 8B	100.00	99.94	99.95 ± 0.07

3. Effectiveness of Prompting and Instruction Tuning

Interventions to improve subtraction performance were tested at scale:

Few-shot Prompting: Incorporating 3, 5, or 10 in-context subtraction examples yields variable and generally modest improvement in negative-result accuracy. For example, Llama-3-8B performance rises from 8.13% (zero-shot w/ sign) to 31.46% (5-shot), but such improvements are inconsistent across models. Magnitude accuracy (ignoring the sign) climbs above 90% with few-shot methods, indicating that numeric calculation is robust, but sign output remains a bottleneck.
Instruction-Tuning: Instruction-tuned models, which have been exposed to subtraction tasks (including $a < b$ cases) in their instruction–fine-tuning corpora (e.g., MATH, GSM8k, Tulu 3), display near-perfect negative sign accuracy—frequently 100%, and not below 88% for any model/language pair. This is true for both single- and multi-token numbers. The operational implication is that exposure to curated subtraction examples, especially those producing negative answers, is a sufficient condition to overcome sign omission.

LLM	Pretrained	Instruction-tuned
Gemma-2-9B	1.33%	100.00%
Llama-3-8B	8.13%	91.42%
OLMo-2-13B	3.70%	99.54%
Qwen3-8B	4.44%	100.00%

Prompting and few-shot learning are insufficient for robust sign generation. Only instruction tuning, with explicit negative-result subtraction samples, reliably induces the correct output behavior (Jobanputra et al., 4 Nov 2025).

4. Subtraction as a Diagnostic and Research Frontier

The consistently poor performance of pretrained LLMs on negative-result subtraction, traced to the negative sign omission and the disconnect between internal representation and output, suggests several research imperatives:

Subtraction (especially requiring negative results) should be included as a core sub-task in numeracy benchmarks, as it provides a more discriminating assessment of arithmetic capability than addition, which is structurally commutative.
The limitation is not numeric facility, but output mapping: bridging the gap between scalar attribute tracking and text generation remains an unresolved modeling challenge.
Instruction tuning architectures—and by extension, pretraining data curation—must introduce systematically diverse subtraction problems in sufficient volume to “debug” this bottleneck without side effects.
These findings generalize to multi-token (i.e., larger) numbers, indicating that the underlying phenomena are not artifactually restricted to simple tokenization regimes.

5. Mathematical and Evaluation Protocols

The subtraction task is precisely $a-b$ , with experimental emphasis on $a < b$ , i.e., $a-b < 0$ . Key error metric: models output $|a-b|$ in place of $-(b-a)$ . Test distributions are balanced over $a > b$ and $a < b$ ; $a = b$ is excluded. Five prompt templates are used for robustness, and for each subtraction structure, corresponding addition problems are evaluated as gold-standard baselines.

Evaluation is conducted for both with-sign (“w (-)”) and sign-agnostic (“w/o (-)”) accuracy to isolate the locus of error. Probing accuracy is assessed using simple linear models on hidden states, establishing the separability of sign as an internal feature. Post-instruction tuning, accuracy for $a < b$ cases (including negative answer) is evaluated and compared with that for $a > b$ (positive answer) and addition.

6. Synthesis and Outlook

Subtraction in LLMs demonstrates sharp limitations in pretrained models’ arithmetic delegation at the surface—specifically, systematic omission of the negative sign for $a-b$ when $a < b$ —despite robust internal encoding of the required sign. The primary determinant of subtraction reliability is exposure to explicit negative-result subtraction tasks during instruction tuning, not model scale or architecture. Few-shot prompting and prompt engineering provide only marginal, unstable gains.

This establishes subtraction as a critical metric in the assessment of numeracy in LLMs, uncovers an architectural interface between internal computation and generation that is uniquely taxed by sign-handling, and underscores the necessity of targeted instruction tuning for reliable numerical reasoning.

Diagnostic Step	Effect
Arithmetic pretraining only	Subtraction lags, negative answers often missed
Few-shot prompting	Modest/inconsistent sign gains, strong for magnitude only
Instruction tuning (with negatives)	Robust sign accuracy, matches addition
Probing internal states	Nearly perfect sign discrimination, decoding gap

Addressing the output-decoding disconnect, particularly for symbolic properties like sign, remains a current grand challenge for making LLMs mathematically reliable beyond additive and positive domains. This is central for deployment in mathematical, scientific, and decision-critical NLP contexts where sign inversion is catastrophic.

References to Key Results and Datasets

LLM subtraction accuracy vs. addition (Jobanputra et al., 4 Nov 2025), Table 1, 2, Figure 1, 3.
Probing results for sign representation, Table in Key Findings.
Instruction tuning evaluation, Main Results, Table.
Test setup: operand sampling, prompt structures, balance statistics.

Summary Table: Subtraction in LLMs

Capability	Pretrained	Few-shot	Instruction-tuned
Sign accuracy (a < b)	2–14%	up to 31% (+)	≥91%, often 100%
Magnitude accuracy (a < b, w/o sign)	≥33%, up to ~96%	>90%	≈addition perf.
Internal sign representation	~100% (all setups!)	~100%	~100%

In summary, subtraction reveals surface-level generative and interface weaknesses in LLMs, points to the critical role of instruction tuning for sign robustness, and provides both a diagnostic and a benchmark for future advancements in model architecture and training regimen.

PDF Markdown Chat (Pro)

References (1)

Can LLMs subtract numbers? (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Subtraction in Large Language Models.

Subtraction in LLMs

1. Arithmetic Subtraction Capabilities and Error Patterns

2. Probing and Representation of Negative Results

3. Effectiveness of Prompting and Instruction Tuning

4. Subtraction as a Diagnostic and Research Frontier

5. Mathematical and Evaluation Protocols

6. Synthesis and Outlook

References to Key Results and Datasets

Summary Table: Subtraction in LLMs

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics