Can LLMs subtract numbers? (2511.02795v1)
Abstract: We present a systematic study of subtraction in LLMs. While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a<b$). In such cases, LLMs frequently produce the correct magnitude but omit the negative sign. Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs. We further test well-known techniques such as few-shot learning and instruction-tuning to see if they can improve the LLMs' performance. Our results suggest that while few-shot prompting yields modest gains, the instruction-tuned models achieve near-perfect accuracies in generating the negative sign. Together, these findings provide a clearer characterization of the limitations and recoverability of LLMs' arithmetic capabilities in subtraction.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
Overview
This paper asks a simple-sounding question: Can LLMs—the AI systems that write and read text—correctly do subtraction? People have studied how well LLMs add and multiply, but subtraction hasn’t been checked as carefully. The authors find that subtraction is surprisingly hard for these models, especially when the answer should be negative (below zero).
What questions did the researchers try to answer?
The paper focuses on clear, basic questions:
- Do LLMs solve subtraction as well as they solve addition?
- Does the order of the numbers matter (for example, versus )?
- Where do models make mistakes—are they getting the size of the answer wrong, or just missing the minus sign?
- Can simple tricks like giving examples (few-shot prompting) or using instruction-tuned models fix these problems?
How did they test this?
The team tested eight different LLMs from four families (Gemma-2, Qwen3, OLMo-2, Llama-3). They created many subtraction and addition problems with controlled settings so the tests were fair.
To make this easy to understand, here’s what they did:
- Tokens: LLMs read text in pieces called “tokens.” Some numbers (like 7 or 42) may be a single token; long numbers can be multiple tokens. The main tests used single-token numbers to keep things simple.
- Prompts: They asked the models to solve problems using five different styles of instructions (from short equations to more wordy directions) to make sure the results weren’t just due to wording.
- Zero-shot vs few-shot:
- Zero-shot means the model is asked the question with no examples.
- Few-shot means the model is shown a few solved examples first.
- Instruction-tuned models: These are LLMs that were trained to follow instructions better (like “explain your answer” or “give the final number”). This extra training often includes math problems.
- Probing: Think of this like “reading the model’s mind.” The researchers looked at the model’s internal signals to see if it “knew” the answer should be negative, even if it didn’t write the minus sign. They did this with simple classifiers called linear probes.
They also checked the same patterns with multi-token numbers (longer numbers) and a related expression, , to see if the problem was specifically subtraction or just producing negative numbers.
What did they find, and why is it important?
Here are the main findings, explained simply:
- Subtraction is harder than addition. Many models got almost perfect scores on addition but did much worse on subtraction—often 30–50 percentage points lower.
- The biggest trouble is with negative answers. When , the result of should be negative. In these cases, LLMs often calculated the right size (like “13”) but forgot to put the minus sign (“-13”).
- The problem isn’t just subtraction—it’s the minus sign. Even with expressions like (which should also be negative in the same cases), models showed the same mistake: right magnitude, missing minus.
- Inside, the models “know” the answer should be negative. Probing showed the models’ internal states almost perfectly predicted whether the result should be negative, but the minus sign often didn’t appear in the final output. So, they represent the idea correctly but fail to express it.
- Few-shot examples help a bit, but not consistently. Showing models several examples sometimes raised accuracy, but results were uneven and depended on the model.
- Instruction-tuned models fix it. Models trained to follow instructions did much better—often near-perfect—at including the negative sign and solving subtraction correctly, even if their base versions struggled.
This matters because it shows a specific weakness in LLMs: they can “think” the right thing but fail to write it correctly, especially the negative sign. That’s risky if you rely on LLMs for calculations.
What does this mean for the future?
- Better testing: Subtraction (and negative numbers) should be a standard part of evaluating LLMs’ math skills, not just addition and multiplication.
- Training improvements: Instruction tuning—especially with math datasets that include negative answers—can make a big difference. Future training should focus on helping models correctly output signs, not just compute magnitudes.
- Safer use: If you use LLMs for math or coding, be careful with negative results. Automatic checks (like verifying the sign) may be needed.
- Research direction: There’s a gap between what models represent internally and what they output. Bridging that gap (for example, with better decoding strategies or alignment techniques) could improve reliability.
In short, the paper shows that subtraction reveals a special weakness of LLMs: they often drop the minus sign when the answer is negative. The good news is that instruction-tuned models can fix this, which points to practical ways to improve AI’s basic math skills.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
The following list summarizes what remains missing, uncertain, or unexplored in the paper, with concrete directions for future work:
- Cross-model comparability is confounded by tokenizer ranges and sample sizes (e.g., Qwen3/Gemma-2: single-digit tokens vs Llama/OLMo: up to 999). Normalize operand ranges (via multi-token inputs or standardized numeric serialization) to assess subtraction capability fairly across families.
- Operands are restricted to positive integers. Evaluate cases with negative operands (e.g., , ), decimals, fractions, scientific notation, and other numeric formats to see whether sign errors persist beyond integer subtraction.
- Magnitude scope is limited (up to 999 for some models; multi-token up to 3 tokens). Test longer numbers (more digits), length generalization, and borrowing over many positions to quantify scalability and robustness.
- Only synthetic, equation-style prompts are used. Assess subtraction in realistic contexts (word problems, narratives, tables), and under formatting variations (spaces, parentheses, mixed operators, order-of-operations).
- Answer formatting may bias outputs (e.g., boxed answers). Systematically vary the required output format (plain number, JSON field, strict regex) to measure the impact on negative sign generation and parsing reliability.
- Decoding settings are narrow (greedy for pretrained; heterogeneous defaults for instruction-tuned). Compare standardized decoding modes (greedy, top-k/top-p, beam, constrained decoding) and logit controls (e.g., bias for
-) to isolate generation effects on the sign. - Few-shot design is under-specified (example order, coverage, and ratio of negative-result cases). Probe systematically: vary shot counts, place negative examples first/last, skew the distribution, and include explicit reminders (“include the sign”) to quantify in-context learning effects on sign correctness.
- The linear probe is only applied at the final layer. Conduct layer-wise probing, representation similarity analysis, and causal interventions (e.g., activation/attention patching) to pinpoint where sign information fails to propagate to the output.
- Error analysis focuses on missing negative signs; magnitude errors are not deeply characterized. Quantify and categorize magnitude mistakes (borrow failures, digit-level misalignments, off-by-one) for and to identify non-sign arithmetic weaknesses.
- Chain-of-thought and scratchpad prompting are not directly evaluated in pretrained models. Test whether explicit reasoning steps or external tools (calculator calls) reliably correct negative sign omissions and magnitude errors.
- Instruction-tuned vs pretrained comparisons are confounded by differing system prompts and sampling setups. Re-run instruction-tuned models under matched decoding and prompts to isolate the true effect of instruction tuning.
- The source of instruction-tuning gains is speculative and only examined for OLMo-2. Audit fine-tuning corpora across families (where possible) to correlate negative-number exposure with performance; run ablations (remove subtraction data) to test causality.
- Closed-source LLMs are not evaluated. Test widely-used proprietary models (where available) to establish whether the negative-result failure mode generalizes and whether instruction tuning consistently resolves it.
- Tokenization of negative numbers is not analyzed. Examine how
-and negative numerals are tokenized across models, their frequency in pretraining corpora, and token-level probabilities for-at the first output position to diagnose sign suppression. - Unicode symbol variants and locale effects are ignored. Evaluate the minus sign vs hyphen-minus, spacing conventions, and multilingual prompts to understand cross-script sign generation reliability.
- No targeted interventions are explored to bridge representation-to-generation mismatch. Try logit steering, output-verification-and-rewrite (use hidden states to decide sign), or constrained decoding that enforces sign when .
- Multi-step expressions mixing addition and subtraction, parentheses, and precedence are not tested. Assess whether sign omissions persist in more complex arithmetic sequences and whether error compounding occurs.
- Statistical robustness is limited (three seeds; no confidence intervals). Expand seeds, report confidence intervals, and perform significance tests to quantify variance and ensure conclusions are stable.
- Dataset balancing (equal and ) may not reflect real-world distributions. Study performance under skewed or natural distributions and check for distributional shift effects.
- The “robust parsing mechanism” is not specified or validated. Release the parser and evaluate sensitivity (e.g., whitespace, extraneous tokens, different minus signs) to rule out evaluation artifacts that might penalize correct negative outputs.
- Potential alignment/RLHF effects on the
-token are unexamined. Investigate whether post-training discourages leading-outputs (e.g., formatting norms) and whether de-aligned or base models exhibit fewer sign omissions. - Relationship between subtraction difficulty and known arithmetic circuits (e.g., Fourier features, positional encodings) is not probed. Test architectural/training choices (embeddings, positional encodings, curriculum) that might specifically improve sign handling in subtraction.
- Minimal supervised fixes are not attempted. Evaluate small, targeted fine-tuning or reinforcement learning to correct sign generation and test generalization across prompts, magnitudes, and numeric formats.
Glossary
- Borrowing: The carry-over process used in digit-wise subtraction when a digit in the minuend is smaller than the corresponding digit in the subtrahend. "subtraction accentuates better positional representations to facilitate accurate borrowing."
- Chain-of-thought (CoT) prompting: A prompting technique that elicits step-by-step reasoning in LLMs before producing an answer. "LLMs show strong performance on arithmetic tasks, particularly when aided by chain-of-thought (CoT) prompting and its variants \cite{wei2022chain, kojima2022large, imani-etal-2023-mathprompter, li-etal-2025-exposing}."
- Decoding stage: The generation phase in an LLM where hidden representations are converted into output tokens. "this knowledge is not faithfully transferred to the decoding stage."
- Few-shot prompting: Providing a small number of solved examples in the prompt to improve task performance. "few-shot prompting yields modest gains"
- Greedy sampling: A decoding method that selects the highest-probability token at each step without randomness. "For pretrained LLMs, we use greedy sampling with a temperature of 0"
- H100 GPUs: NVIDIA’s high-performance accelerators used for training and serving large models. "We run all experiments on 4x H100 GPUs using vLLM~\cite{kwon2023efficient} without any quantization."
- In-context learning: An LLM’s ability to learn task behavior from examples provided directly in the prompt. "allowing us to probe in-context learning abilities of LLMs."
- Instruction fine-tuning: A post-training phase where models are optimized on instruction-following datasets to improve alignment and task performance. "We speculate that these gains come from the instruction fine-tuning stage."
- Instruction tuning: Training that teaches models to follow natural language instructions, often improving reasoning and formatting. "Instruction tuning improves subtraction performance to levels comparable with addition in a zero-shot setting."
- Linear probes: Simple linear classifiers trained on model activations to test whether specific information is encoded. "we trained linear probes on the final layer activations of three representative LLMs"
- Multi-token numbers: Numeric values that are split across multiple tokens by a model’s tokenizer. "However, we also analyze multi-token numbers in Appendix \ref{sec:oov_apdx} that show similar results."
- Non-commutative operation: An operation where swapping operands changes the result. "Subtraction differs from addition in an important structural way as it is a non-commutative operation ()."
- Positional representations: Internal encodings that capture the position of tokens or digits, crucial for operations like subtraction. "subtraction accentuates better positional representations to facilitate accurate borrowing."
- Probing analyses: Methods that inspect model internals (e.g., activations) to determine whether certain information is represented. "Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs."
- Quantization: Reducing model precision (e.g., weights/activations) to save memory and compute, often at some accuracy cost. "without any quantization."
- Robust parsing mechanism: A reliable procedure for extracting structured outputs (like numbers) from free-form model text. "We extract the final numerical answer from the LLMs' generated text using a robust parsing mechanism."
- System prompt: A default or initial instruction provided to an instruction-tuned model to guide its behavior. "Additionally, we use the default system prompt for each instruction-tuned LLM and sample up to 500 new tokens."
- Temperature: A sampling parameter that controls randomness in token selection during decoding. "with a temperature of 0"
- Tokenizer Range: The contiguous set of numerical values that a tokenizer encodes as single tokens. "Tokenizer Range indicates the continuous range of numeric values that are single tokens for each LLM."
- vLLM: A system for efficient LLM serving that manages memory and attention to speed inference. "using vLLM~\cite{kwon2023efficient} without any quantization."
- Zero-shot: Performing a task without any in-prompt examples, relying only on model priors and the given instruction. "Zero-shot prompts contained only the query equation formatted under one of the five variants."
Practical Applications
Immediate Applications
Below are deployable applications that leverage the paper’s findings on LLMs’ systematic omission of negative signs in subtraction (especially when ), the demonstrated efficacy of instruction tuning, and the observed internal encoding of sign despite generation errors.
- Sign-safe arithmetic middleware for AI products
- Sectors: software, finance, ecommerce, customer support, education
- What: Wrap LLM outputs with a “calculator-first” or “sign-check” middleware that: (1) extracts operands, (2) deterministically computes the result (including sign), and (3) reconciles with the LLM’s text before returning an answer.
- Tools/products/workflows:
- “Calculator-first” tool use in agent frameworks (function-calling to Python/JS eval)
- Output validators that reject/substitute answers if sign disagrees with deterministic compute
- UI nudges that show sign and magnitude separately, e.g., “Sign: NEG, Magnitude: 123 → -123”
- Assumptions/dependencies: Access to function-calling/tool use; reliable operand parsing; works best when tasks contain explicit numeric expressions.
- Numeracy QA gates in MLOps and model selection
- Sectors: software (model evaluation), finance/accounting (risk), education (edtech)
- What: Add a “subtraction stress test” focusing on and “-b+a” templates in evaluation harnesses to gate model promotion and provider choice.
- Tools/products/workflows:
- Test suites with balanced vs cases and prompt variants from the paper
- “Magnitude-only” vs “magnitude-plus-sign” metrics to localize sign failures
- Assumptions/dependencies: Benchmarking infra; reproducible prompting (greedy decoding, zero temp).
- Prefer instruction-tuned models for arithmetic-facing user flows
- Sectors: finance, productivity apps, assistants, customer support
- What: Route math-prone tasks to instruction-tuned variants (which the paper shows achieve near-perfect negative-sign handling).
- Tools/products/workflows:
- Policy-based router in orchestration layer (if “math-like” → use IT model)
- Prompt styles that require short final numeric answers (e.g., boxed or strict schemas)
- Assumptions/dependencies: Availability of instruction-tuned models; cost/latency budgets.
- Prompt patches to mitigate sign omissions (stopgap)
- Sectors: chat assistants, education, SMB tooling
- What: Use minimal few-shot or structured prompts requiring explicit sign fields: “Sign (POS/NEG): _; Magnitude: _; Final: __”
- Tools/products/workflows:
- Templates that require sign token first, then value (schema-constrained decoding)
- Assumptions/dependencies: Few-shot effects are modest/inconsistent per the paper; treat as mitigation, not a fix.
- Accounting-/finance-safe output formats and checks
- Sectors: finance, accounting, fintech
- What: Standardize outputs to display negative values redundantly (e.g., “-123 (negative)”), or in parentheses per accounting style “(123)”, then reconcile with a deterministic calculator.
- Tools/products/workflows:
- Double-entry validations; sign-consistency checks across subtotals and totals
- Assumptions/dependencies: Domain formatting conventions; availability of deterministic recompute.
- Data curation recipes for post-training
- Sectors: academia, model training orgs, open-source labs
- What: Augment instruction-tuning data with targeted negative-result subtraction, including diverse phrasings (-b+a, word problems with debts/temperatures/losses).
- Tools/products/workflows:
- Simple generators mirroring the paper’s balanced design and prompt variants
- Assumptions/dependencies: Access to post-training; dataset licensing/quality control.
- “Sign-aware” evaluation widgets for educators and learners
- Sectors: education
- What: Classroom tools that highlight the non-commutativity of subtraction and the borrowing/sign distinction; auto-detect when an LLM got the magnitude right but sign wrong.
- Tools/products/workflows:
- Edtech widgets that parse LLM answers and annotate errors as “sign-only” vs “magnitude”
- Assumptions/dependencies: Simple integration with calculators; clear pedagogy.
- Safety and compliance guidance for procurement and deployment
- Sectors: policy, compliance, regulated industries (finance, healthcare billing)
- What: Add “negative-result arithmetic” to AI procurement checklists and acceptance criteria; require tool use for numeric outputs.
- Tools/products/workflows:
- Model accept/reject criteria; audit logs for when tool use corrected a sign
- Assumptions/dependencies: Organizational buy-in; mapping to existing risk frameworks.
- Developer hygiene in code assistants: test synthesis emphasizing
- Sectors: software engineering
- What: Code assistants auto-generate unit tests that emphasize negative-return paths and subtraction cases to catch downstream logic bugs.
- Tools/products/workflows:
- Test generators that bias toward negative outcomes; mutation tests for sign handling
- Assumptions/dependencies: CI integration; basic LLM prompt control.
- Consumer guidance: math tasks routed to calculators by default
- Sectors: daily life, productivity
- What: Configure personal assistants and note-taking apps to hand off arithmetic to calculators and return signed results verbatim.
- Tools/products/workflows:
- Shortcut/automation rules: “If prompt contains arithmetic → call calculator”
- Assumptions/dependencies: App supports function calls; user consent for tool use.
Long-Term Applications
These require further research, model access changes, scaling, or productization before widespread deployment.
- Representation-to-decoding alignment methods (probe-guided decoding)
- Sectors: software, AI platforms
- What: Since models internally encode the correct sign, research methods that use internal probes to steer logits during decoding so the sign surfaces in output.
- Tools/products/workflows:
- Logit-steering hooks; activation conditioning; “representation-aligned decoding” APIs
- Assumptions/dependencies: Access to hidden states/weights; provider cooperation.
- Architecture and tokenizer design for robust signed numbers
- Sectors: AI model development
- What: Explore tokenizers/architectures where negative numbers are unified tokens or where sign is a structured feature, improving sign retention.
- Tools/products/workflows:
- Tokenizer redesign; embedding features that disentangle sign and magnitude
- Assumptions/dependencies: Pretraining costs; backward compatibility.
- Standardized numeracy certification and regulatory tests
- Sectors: policy, compliance
- What: Establish industry standards for “numeracy reliability,” with mandatory subtraction/negative-result test suites for LLMs used in finance, billing, and auditing.
- Tools/products/workflows:
- Certification programs; third-party audits; reporting of “magnitude vs sign” error rates
- Assumptions/dependencies: Consensus on metrics; regulator adoption.
- Curriculum and training strategies for non-commutative operations and borrowing
- Sectors: academia, AI training organizations
- What: Incorporate targeted curricula that emphasize non-commutativity and borrowing into pretraining/post-training to generalize beyond subtraction (e.g., signed deltas, gradients).
- Tools/products/workflows:
- Synthetic curricula; adversarial data that forces sign fidelity under varied phrasing
- Assumptions/dependencies: Demonstrated transfer to broader reasoning tasks.
- Universal “math tool-first” agent patterns for enterprise
- Sectors: finance, operations, supply chain, analytics
- What: Enterprise agents that always call verified tools for any numeric transformations (subtractions, deltas, balance updates) with audit trails for sign correctness.
- Tools/products/workflows:
- Enterprise orchestration patterns; tool-call policies; verifiers (e.g., MathPrompter-style aggregation)
- Assumptions/dependencies: Tool coverage; latency budgets; system integration.
- Robust numeric understanding in long contexts and multi-token ranges
- Sectors: analytics, legal/contract analysis, procurement
- What: Extend reliability to multi-token numbers and long documents (the paper shows similar trends, but performance drops), ensuring sign fidelity in large-context operations (e.g., balance sheets, invoices).
- Tools/products/workflows:
- Long-context numeric resolvers; structured extraction plus verified arithmetic passes
- Assumptions/dependencies: Long-context models; scalable parsers.
- UI/UX standards for “sign-critical” decisions
- Sectors: fintech, ecommerce, logistics
- What: Design patterns that require explicit confirmation when a negative value drives an action (refunds, write-downs, penalties), with sign-aware highlighting and cross-checks.
- Tools/products/workflows:
- Dual-channel display (sign + magnitude); confirmatory dialogs; risk-tiered flows
- Assumptions/dependencies: Product changes; user training.
- Adversarial testing and security hardening against sign exploitation
- Sectors: security, marketplaces, dynamic pricing
- What: Red-team suites that probe prompt manipulations causing sign flips or omissions, preventing mispricing or fraud in LLM-mediated workflows.
- Tools/products/workflows:
- Adversarial benchmarks targeting and negative deltas; anomaly detection on signed outputs
- Assumptions/dependencies: Access to logs; feedback loops.
- Domain-adapted instruction tuning for specialized numeracy
- Sectors: healthcare billing, energy metering, accounting standards
- What: Fine-tune models with domain-specific subtraction cases (e.g., net energy consumption, dosage adjustments, write-offs) and strict output schemas.
- Tools/products/workflows:
- Domain corpora with abundant negative results; schema-constrained decoding
- Assumptions/dependencies: High-quality labeled data; governance for sensitive data.
- Research on decoding constraints and grammar-guided generation for signed numbers
- Sectors: AI platforms
- What: Grammar-based decoding that enforces the presence/absence of a minus sign based on upstream deterministic checks or structured intermediate representations.
- Tools/products/workflows:
- Constrained decoding APIs; structured intermediate outputs (Sign, Magnitude) compiled into final text
- Assumptions/dependencies: Widespread availability of constrained decoding; reliable intermediate extraction.
Collections
Sign up for free to add this paper to one or more collections.