Subtraction in LLMs
- Subtraction in LLMs is defined as computing a-b with challenges in handling non-commutative operations and correct negative sign output.
- Analysis shows pretrained models often omit the negative sign for a-b when a < b, despite internal states accurately encoding the sign.
- Instruction tuning with explicit negative-result examples significantly boosts sign accuracy, aligning subtraction performance closer to addition.
Subtraction in LLMs
Subtraction in LLMs refers to both the explicit task of arithmetic subtraction (computing for numbers ) and a range of related operations involving the use of subtraction on model parameters or internal representations for model editing, ability isolation, or knowledge removal. Subtraction occupies a foundational role in evaluating and improving LLMs’ mathematical reasoning, compositionality, model modularity, and knowledge management capabilities. Unlike addition, subtraction is non-commutative and requires attention to directionality and sign, factors which have exposed both computational and representational limitations in contemporary LLMs.
1. Arithmetic Subtraction Capabilities and Error Patterns
The arithmetic subtraction task evaluates an LLM's ability to produce correct outputs for expressions of the form , with special interest in cases where (requiring a negative result). Comprehensive benchmarking across open-source LLMs (Gemma-2, Qwen3, OLMo-2, Llama-3; 8–70B parameters) demonstrates a persistent and substantial gap between addition and subtraction (Jobanputra et al., 4 Nov 2025). While addition accuracies are typically near perfect (≥99%) for single-token numbers, subtraction accuracies drop precipitously—often by 30–50 percentage points, with pretrained models averaging only ~56% accuracy (see Table below).
| LLM | Addition | Subtraction (a < b, w/ sign) | Subtraction (a < b, magnitude only) |
|---|---|---|---|
| Qwen3-8B | ~100% | 4.44% | 36.89% |
| Llama-3-8B | ~99% | 8.13% | 33.51% |
| OLMo-2-32B | ~99% | 13.65% | 96.55% |
The dominant error mode in subtraction is omission of the negative sign in cases where . In these instances, models nearly always produce the correct magnitude but fail to emit the leading "-"; e.g., for $2-5$, outputting "3" in place of "-3". Accuracy for with is typically near addition-level, revealing an asymmetry tied to sign handling rather than numerical computation per se. This behavior is robust to prompt formulation, operand reordering, and number tokenization granularity, and persists for multi-token numbers.
2. Probing and Representation of Negative Results
Linear probing analyses expose a crucial disconnect: although the output layer of LLMs frequently omits the negative sign, internal hidden states almost perfectly encode whether a given instance should produce a negative result. Probes trained on model states (Qwen3-8B, Gemma-2-9B, Llama-3-8B) achieve ~99.95% accuracy in classifying the necessary sign, both for single-token and multi-token numbers (Jobanputra et al., 4 Nov 2025). This indicates that the LLM’s numeracy “knows” when output should be negative, but this information is not reflected in generation. The translation of scalar or symbolic attributes (such as sign) from internal computation to language output thus forms a key failure point for subtraction in LLMs.
| LLM | Single-token | Multi-token | Combined |
|---|---|---|---|
| Gemma-2 9B | 100.00 | 99.60 | 99.77 ± 0.08 |
| Llama-3.1 8B | 99.66 | 99.10 | 99.37 ± 0.09 |
| Qwen3 8B | 100.00 | 99.94 | 99.95 ± 0.07 |
3. Effectiveness of Prompting and Instruction Tuning
Interventions to improve subtraction performance were tested at scale:
- Few-shot Prompting: Incorporating 3, 5, or 10 in-context subtraction examples yields variable and generally modest improvement in negative-result accuracy. For example, Llama-3-8B performance rises from 8.13% (zero-shot w/ sign) to 31.46% (5-shot), but such improvements are inconsistent across models. Magnitude accuracy (ignoring the sign) climbs above 90% with few-shot methods, indicating that numeric calculation is robust, but sign output remains a bottleneck.
- Instruction-Tuning: Instruction-tuned models, which have been exposed to subtraction tasks (including cases) in their instruction–fine-tuning corpora (e.g., MATH, GSM8k, Tulu 3), display near-perfect negative sign accuracy—frequently 100%, and not below 88% for any model/language pair. This is true for both single- and multi-token numbers. The operational implication is that exposure to curated subtraction examples, especially those producing negative answers, is a sufficient condition to overcome sign omission.
| LLM | Pretrained | Instruction-tuned |
|---|---|---|
| Gemma-2-9B | 1.33% | 100.00% |
| Llama-3-8B | 8.13% | 91.42% |
| OLMo-2-13B | 3.70% | 99.54% |
| Qwen3-8B | 4.44% | 100.00% |
Prompting and few-shot learning are insufficient for robust sign generation. Only instruction tuning, with explicit negative-result subtraction samples, reliably induces the correct output behavior (Jobanputra et al., 4 Nov 2025).
4. Subtraction as a Diagnostic and Research Frontier
The consistently poor performance of pretrained LLMs on negative-result subtraction, traced to the negative sign omission and the disconnect between internal representation and output, suggests several research imperatives:
- Subtraction (especially requiring negative results) should be included as a core sub-task in numeracy benchmarks, as it provides a more discriminating assessment of arithmetic capability than addition, which is structurally commutative.
- The limitation is not numeric facility, but output mapping: bridging the gap between scalar attribute tracking and text generation remains an unresolved modeling challenge.
- Instruction tuning architectures—and by extension, pretraining data curation—must introduce systematically diverse subtraction problems in sufficient volume to “debug” this bottleneck without side effects.
- These findings generalize to multi-token (i.e., larger) numbers, indicating that the underlying phenomena are not artifactually restricted to simple tokenization regimes.
5. Mathematical and Evaluation Protocols
The subtraction task is precisely , with experimental emphasis on , i.e., . Key error metric: models output in place of . Test distributions are balanced over and ; is excluded. Five prompt templates are used for robustness, and for each subtraction structure, corresponding addition problems are evaluated as gold-standard baselines.
Evaluation is conducted for both with-sign (“w (-)”) and sign-agnostic (“w/o (-)”) accuracy to isolate the locus of error. Probing accuracy is assessed using simple linear models on hidden states, establishing the separability of sign as an internal feature. Post-instruction tuning, accuracy for cases (including negative answer) is evaluated and compared with that for (positive answer) and addition.
6. Synthesis and Outlook
Subtraction in LLMs demonstrates sharp limitations in pretrained models’ arithmetic delegation at the surface—specifically, systematic omission of the negative sign for when —despite robust internal encoding of the required sign. The primary determinant of subtraction reliability is exposure to explicit negative-result subtraction tasks during instruction tuning, not model scale or architecture. Few-shot prompting and prompt engineering provide only marginal, unstable gains.
This establishes subtraction as a critical metric in the assessment of numeracy in LLMs, uncovers an architectural interface between internal computation and generation that is uniquely taxed by sign-handling, and underscores the necessity of targeted instruction tuning for reliable numerical reasoning.
| Diagnostic Step | Effect |
|---|---|
| Arithmetic pretraining only | Subtraction lags, negative answers often missed |
| Few-shot prompting | Modest/inconsistent sign gains, strong for magnitude only |
| Instruction tuning (with negatives) | Robust sign accuracy, matches addition |
| Probing internal states | Nearly perfect sign discrimination, decoding gap |
Addressing the output-decoding disconnect, particularly for symbolic properties like sign, remains a current grand challenge for making LLMs mathematically reliable beyond additive and positive domains. This is central for deployment in mathematical, scientific, and decision-critical NLP contexts where sign inversion is catastrophic.
References to Key Results and Datasets
- LLM subtraction accuracy vs. addition (Jobanputra et al., 4 Nov 2025), Table 1, 2, Figure 1, 3.
- Probing results for sign representation, Table in Key Findings.
- Instruction tuning evaluation, Main Results, Table.
- Test setup: operand sampling, prompt structures, balance statistics.
Summary Table: Subtraction in LLMs
| Capability | Pretrained | Few-shot | Instruction-tuned |
|---|---|---|---|
| Sign accuracy (a < b) | 2–14% | up to 31% (+) | ≥91%, often 100% |
| Magnitude accuracy (a < b, w/o sign) | ≥33%, up to ~96% | >90% | ≈addition perf. |
| Internal sign representation | ~100% (all setups!) | ~100% | ~100% |
In summary, subtraction reveals surface-level generative and interface weaknesses in LLMs, points to the critical role of instruction tuning for sign robustness, and provides both a diagnostic and a benchmark for future advancements in model architecture and training regimen.