Papers
Topics
Authors
Recent
2000 character limit reached

Can LLMs subtract numbers? (2511.02795v1)

Published 4 Nov 2025 in cs.LG and cs.CL

Abstract: We present a systematic study of subtraction in LLMs. While prior benchmarks emphasize addition and multiplication, subtraction has received comparatively little attention despite being structurally distinct as a non-commutative operation. We evaluate eight pretrained LLMs spanning four families on addition and subtraction problems. Our experiments reveal that subtraction accuracy lags behind addition by a wide margin. We find that the errors for ($a-b$) are concentrated in cases where ($a<b$). In such cases, LLMs frequently produce the correct magnitude but omit the negative sign. Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs. We further test well-known techniques such as few-shot learning and instruction-tuning to see if they can improve the LLMs' performance. Our results suggest that while few-shot prompting yields modest gains, the instruction-tuned models achieve near-perfect accuracies in generating the negative sign. Together, these findings provide a clearer characterization of the limitations and recoverability of LLMs' arithmetic capabilities in subtraction.

Summary

  • The paper evidences that LLMs show significantly lower accuracy on subtraction, particularly when handling negative results.
  • It employs synthetic datasets and both zero-shot and few-shot settings to systematically benchmark eight diverse LLM families.
  • Instruction tuning is shown to markedly improve subtraction performance by addressing the error of omitting the negative sign.

Subtraction in LLMs: Systematic Evaluation and Error Analysis

Motivation and Background

This paper addresses a notable gap in the evaluation of LLMs' arithmetic capabilities by focusing on subtraction, a non-commutative operation that has received limited attention compared to addition and multiplication. The authors argue that subtraction presents unique challenges for LLMs due to its reliance on operand order and the necessity for robust positional representations, especially in cases requiring borrowing. The paper systematically benchmarks eight open-source LLMs from four families (Gemma-2, Qwen3, OLMo-2, Llama-3) on both addition and subtraction tasks, with a particular emphasis on the structural and representational demands of subtraction.

Experimental Design

The evaluation protocol involves generating synthetic datasets of integer arithmetic problems, with operands sampled uniformly from the range of single-token numbers for each model's tokenizer. The datasets are balanced for a>ba > b and a<ba < b cases, and multiple prompt variants are used to control for spurious correlations. Both zero-shot and n-shot (few-shot) settings are considered, and inference is performed using greedy decoding for pretrained models and recommended sampling strategies for instruction-tuned variants. The paper also extends the analysis to multi-token numbers and probes the internal representations of LLMs to assess their encoding of sign information.

Key Findings

Subtraction vs. Addition Performance

Across all evaluated LLMs, subtraction accuracy is consistently and substantially lower than addition accuracy under matched complexity. For example, Qwen3-8B and OLMo-2-32B achieve near-perfect scores on addition but only approximately 57% and 56% on subtraction, respectively. Smaller models perform poorly on both tasks, but the performance gap between addition and subtraction is pronounced in larger models.

Asymmetry in Operand Order

A strong asymmetry is observed in subtraction performance: LLMs are highly accurate when a>ba > b (result is positive) but their accuracy collapses when a<ba < b (result is negative). In several cases, accuracy drops below 5% for negative results, even in models that are otherwise highly competent at addition and positive subtraction.

Error Analysis: Omission of Negative Sign

The dominant failure mode for a<ba < b is the omission of the negative sign. When accuracy is measured without regard to sign, scores increase dramatically, often exceeding 90%. This indicates that LLMs frequently compute the correct magnitude but fail to output the negative sign, suggesting a disconnect between internal representation and output generation.

Probing Internal Representations

Linear probes trained on the final layer activations of representative LLMs (Gemma-2 9B, Llama-3.1-8B, Qwen3-8B) achieve near-perfect accuracy in predicting whether the result should be positive or negative. This demonstrates that LLMs internally encode the sign information, but this knowledge is not reliably surfaced in the generated output.

Few-shot and Instruction-tuned Improvements

Few-shot prompting yields modest and inconsistent improvements in subtraction accuracy for pretrained LLMs, with some models showing moderate gains and others remaining unstable. In contrast, instruction-tuned LLMs achieve near-perfect accuracy on subtraction, including cases where a<ba < b. The authors attribute this improvement to exposure to subtraction data during instruction fine-tuning, as confirmed by analysis of the OLMo-2 instruction-tuning dataset.

Implications and Future Directions

The findings highlight subtraction as an inherently more challenging task for LLMs than addition, primarily due to the non-commutative nature of the operation and the requirement to generate negative results. The systematic omission of the negative sign, despite correct internal encoding, points to a representational-generation mismatch that warrants further investigation. Instruction tuning is shown to be highly effective in bridging this gap, suggesting that targeted data augmentation and fine-tuning strategies can substantially improve LLMs' arithmetic capabilities.

From a practical perspective, these results underscore the importance of including subtraction (and negative-result cases) as a standard diagnostic in LLM evaluation benchmarks. Theoretical implications include the need to better understand the mechanisms by which LLMs transfer internal representations to output, particularly for tasks involving sign and magnitude. Future research should explore architectural modifications, decoding strategies, and training objectives that explicitly address this mismatch. Additionally, the extension of these findings to multi-token numbers and more complex arithmetic operations remains an open area for further paper.

Conclusion

This paper provides a comprehensive analysis of subtraction in LLMs, revealing a persistent challenge in generating negative results despite correct internal computation. The observed representational-generation disconnect and the efficacy of instruction tuning have significant implications for both the evaluation and improvement of LLMs' numerical reasoning abilities. Subtraction should be prioritized as a diagnostic tool in future research, and targeted fine-tuning approaches offer a promising path toward closing the gap in arithmetic competence.

Whiteboard

Explain it Like I'm 14

Overview

This paper asks a simple-sounding question: Can LLMs—the AI systems that write and read text—correctly do subtraction? People have studied how well LLMs add and multiply, but subtraction hasn’t been checked as carefully. The authors find that subtraction is surprisingly hard for these models, especially when the answer should be negative (below zero).

What questions did the researchers try to answer?

The paper focuses on clear, basic questions:

  • Do LLMs solve subtraction as well as they solve addition?
  • Does the order of the numbers matter (for example, aba-b versus bab-a)?
  • Where do models make mistakes—are they getting the size of the answer wrong, or just missing the minus sign?
  • Can simple tricks like giving examples (few-shot prompting) or using instruction-tuned models fix these problems?

How did they test this?

The team tested eight different LLMs from four families (Gemma-2, Qwen3, OLMo-2, Llama-3). They created many subtraction and addition problems with controlled settings so the tests were fair.

To make this easy to understand, here’s what they did:

  • Tokens: LLMs read text in pieces called “tokens.” Some numbers (like 7 or 42) may be a single token; long numbers can be multiple tokens. The main tests used single-token numbers to keep things simple.
  • Prompts: They asked the models to solve problems using five different styles of instructions (from short equations to more wordy directions) to make sure the results weren’t just due to wording.
  • Zero-shot vs few-shot:
    • Zero-shot means the model is asked the question with no examples.
    • Few-shot means the model is shown a few solved examples first.
  • Instruction-tuned models: These are LLMs that were trained to follow instructions better (like “explain your answer” or “give the final number”). This extra training often includes math problems.
  • Probing: Think of this like “reading the model’s mind.” The researchers looked at the model’s internal signals to see if it “knew” the answer should be negative, even if it didn’t write the minus sign. They did this with simple classifiers called linear probes.

They also checked the same patterns with multi-token numbers (longer numbers) and a related expression, b+a-b+a, to see if the problem was specifically subtraction or just producing negative numbers.

What did they find, and why is it important?

Here are the main findings, explained simply:

  • Subtraction is harder than addition. Many models got almost perfect scores on addition but did much worse on subtraction—often 30–50 percentage points lower.
  • The biggest trouble is with negative answers. When a<ba<b, the result of aba-b should be negative. In these cases, LLMs often calculated the right size (like “13”) but forgot to put the minus sign (“-13”).
  • The problem isn’t just subtraction—it’s the minus sign. Even with expressions like b+a-b+a (which should also be negative in the same cases), models showed the same mistake: right magnitude, missing minus.
  • Inside, the models “know” the answer should be negative. Probing showed the models’ internal states almost perfectly predicted whether the result should be negative, but the minus sign often didn’t appear in the final output. So, they represent the idea correctly but fail to express it.
  • Few-shot examples help a bit, but not consistently. Showing models several examples sometimes raised accuracy, but results were uneven and depended on the model.
  • Instruction-tuned models fix it. Models trained to follow instructions did much better—often near-perfect—at including the negative sign and solving subtraction correctly, even if their base versions struggled.

This matters because it shows a specific weakness in LLMs: they can “think” the right thing but fail to write it correctly, especially the negative sign. That’s risky if you rely on LLMs for calculations.

What does this mean for the future?

  • Better testing: Subtraction (and negative numbers) should be a standard part of evaluating LLMs’ math skills, not just addition and multiplication.
  • Training improvements: Instruction tuning—especially with math datasets that include negative answers—can make a big difference. Future training should focus on helping models correctly output signs, not just compute magnitudes.
  • Safer use: If you use LLMs for math or coding, be careful with negative results. Automatic checks (like verifying the sign) may be needed.
  • Research direction: There’s a gap between what models represent internally and what they output. Bridging that gap (for example, with better decoding strategies or alignment techniques) could improve reliability.

In short, the paper shows that subtraction reveals a special weakness of LLMs: they often drop the minus sign when the answer is negative. The good news is that instruction-tuned models can fix this, which points to practical ways to improve AI’s basic math skills.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The following list summarizes what remains missing, uncertain, or unexplored in the paper, with concrete directions for future work:

  • Cross-model comparability is confounded by tokenizer ranges and sample sizes (e.g., Qwen3/Gemma-2: single-digit tokens vs Llama/OLMo: up to 999). Normalize operand ranges (via multi-token inputs or standardized numeric serialization) to assess subtraction capability fairly across families.
  • Operands are restricted to positive integers. Evaluate cases with negative operands (e.g., a(b)a - (-b), (a)b(-a) - b), decimals, fractions, scientific notation, and other numeric formats to see whether sign errors persist beyond integer subtraction.
  • Magnitude scope is limited (up to 999 for some models; multi-token up to 3 tokens). Test longer numbers (more digits), length generalization, and borrowing over many positions to quantify scalability and robustness.
  • Only synthetic, equation-style prompts are used. Assess subtraction in realistic contexts (word problems, narratives, tables), and under formatting variations (spaces, parentheses, mixed operators, order-of-operations).
  • Answer formatting may bias outputs (e.g., boxed answers). Systematically vary the required output format (plain number, JSON field, strict regex) to measure the impact on negative sign generation and parsing reliability.
  • Decoding settings are narrow (greedy for pretrained; heterogeneous defaults for instruction-tuned). Compare standardized decoding modes (greedy, top-k/top-p, beam, constrained decoding) and logit controls (e.g., bias for -) to isolate generation effects on the sign.
  • Few-shot design is under-specified (example order, coverage, and ratio of negative-result cases). Probe systematically: vary shot counts, place negative examples first/last, skew the distribution, and include explicit reminders (“include the sign”) to quantify in-context learning effects on sign correctness.
  • The linear probe is only applied at the final layer. Conduct layer-wise probing, representation similarity analysis, and causal interventions (e.g., activation/attention patching) to pinpoint where sign information fails to propagate to the output.
  • Error analysis focuses on missing negative signs; magnitude errors are not deeply characterized. Quantify and categorize magnitude mistakes (borrow failures, digit-level misalignments, off-by-one) for a>ba>b and a<ba<b to identify non-sign arithmetic weaknesses.
  • Chain-of-thought and scratchpad prompting are not directly evaluated in pretrained models. Test whether explicit reasoning steps or external tools (calculator calls) reliably correct negative sign omissions and magnitude errors.
  • Instruction-tuned vs pretrained comparisons are confounded by differing system prompts and sampling setups. Re-run instruction-tuned models under matched decoding and prompts to isolate the true effect of instruction tuning.
  • The source of instruction-tuning gains is speculative and only examined for OLMo-2. Audit fine-tuning corpora across families (where possible) to correlate negative-number exposure with performance; run ablations (remove subtraction data) to test causality.
  • Closed-source LLMs are not evaluated. Test widely-used proprietary models (where available) to establish whether the negative-result failure mode generalizes and whether instruction tuning consistently resolves it.
  • Tokenization of negative numbers is not analyzed. Examine how - and negative numerals are tokenized across models, their frequency in pretraining corpora, and token-level probabilities for - at the first output position to diagnose sign suppression.
  • Unicode symbol variants and locale effects are ignored. Evaluate the minus sign vs hyphen-minus, spacing conventions, and multilingual prompts to understand cross-script sign generation reliability.
  • No targeted interventions are explored to bridge representation-to-generation mismatch. Try logit steering, output-verification-and-rewrite (use hidden states to decide sign), or constrained decoding that enforces sign when a<ba<b.
  • Multi-step expressions mixing addition and subtraction, parentheses, and precedence are not tested. Assess whether sign omissions persist in more complex arithmetic sequences and whether error compounding occurs.
  • Statistical robustness is limited (three seeds; no confidence intervals). Expand seeds, report confidence intervals, and perform significance tests to quantify variance and ensure conclusions are stable.
  • Dataset balancing (equal a>ba>b and a<ba<b) may not reflect real-world distributions. Study performance under skewed or natural distributions and check for distributional shift effects.
  • The “robust parsing mechanism” is not specified or validated. Release the parser and evaluate sensitivity (e.g., whitespace, extraneous tokens, different minus signs) to rule out evaluation artifacts that might penalize correct negative outputs.
  • Potential alignment/RLHF effects on the - token are unexamined. Investigate whether post-training discourages leading - outputs (e.g., formatting norms) and whether de-aligned or base models exhibit fewer sign omissions.
  • Relationship between subtraction difficulty and known arithmetic circuits (e.g., Fourier features, positional encodings) is not probed. Test architectural/training choices (embeddings, positional encodings, curriculum) that might specifically improve sign handling in subtraction.
  • Minimal supervised fixes are not attempted. Evaluate small, targeted fine-tuning or reinforcement learning to correct sign generation and test generalization across prompts, magnitudes, and numeric formats.

Glossary

  • Borrowing: The carry-over process used in digit-wise subtraction when a digit in the minuend is smaller than the corresponding digit in the subtrahend. "subtraction accentuates better positional representations to facilitate accurate borrowing."
  • Chain-of-thought (CoT) prompting: A prompting technique that elicits step-by-step reasoning in LLMs before producing an answer. "LLMs show strong performance on arithmetic tasks, particularly when aided by chain-of-thought (CoT) prompting and its variants \cite{wei2022chain, kojima2022large, imani-etal-2023-mathprompter, li-etal-2025-exposing}."
  • Decoding stage: The generation phase in an LLM where hidden representations are converted into output tokens. "this knowledge is not faithfully transferred to the decoding stage."
  • Few-shot prompting: Providing a small number of solved examples in the prompt to improve task performance. "few-shot prompting yields modest gains"
  • Greedy sampling: A decoding method that selects the highest-probability token at each step without randomness. "For pretrained LLMs, we use greedy sampling with a temperature of 0"
  • H100 GPUs: NVIDIA’s high-performance accelerators used for training and serving large models. "We run all experiments on 4x H100 GPUs using vLLM~\cite{kwon2023efficient} without any quantization."
  • In-context learning: An LLM’s ability to learn task behavior from examples provided directly in the prompt. "allowing us to probe in-context learning abilities of LLMs."
  • Instruction fine-tuning: A post-training phase where models are optimized on instruction-following datasets to improve alignment and task performance. "We speculate that these gains come from the instruction fine-tuning stage."
  • Instruction tuning: Training that teaches models to follow natural language instructions, often improving reasoning and formatting. "Instruction tuning improves subtraction performance to levels comparable with addition in a zero-shot setting."
  • Linear probes: Simple linear classifiers trained on model activations to test whether specific information is encoded. "we trained linear probes on the final layer activations of three representative LLMs"
  • Multi-token numbers: Numeric values that are split across multiple tokens by a model’s tokenizer. "However, we also analyze multi-token numbers in Appendix \ref{sec:oov_apdx} that show similar results."
  • Non-commutative operation: An operation where swapping operands changes the result. "Subtraction differs from addition in an important structural way as it is a non-commutative operation (abbaa-b \neq b-a)."
  • Positional representations: Internal encodings that capture the position of tokens or digits, crucial for operations like subtraction. "subtraction accentuates better positional representations to facilitate accurate borrowing."
  • Probing analyses: Methods that inspect model internals (e.g., activations) to determine whether certain information is represented. "Probing analyses show that LLMs internally encode whether results should be negative, yet this information is often not reflected in generated outputs."
  • Quantization: Reducing model precision (e.g., weights/activations) to save memory and compute, often at some accuracy cost. "without any quantization."
  • Robust parsing mechanism: A reliable procedure for extracting structured outputs (like numbers) from free-form model text. "We extract the final numerical answer from the LLMs' generated text using a robust parsing mechanism."
  • System prompt: A default or initial instruction provided to an instruction-tuned model to guide its behavior. "Additionally, we use the default system prompt for each instruction-tuned LLM and sample up to 500 new tokens."
  • Temperature: A sampling parameter that controls randomness in token selection during decoding. "with a temperature of 0"
  • Tokenizer Range: The contiguous set of numerical values that a tokenizer encodes as single tokens. "Tokenizer Range indicates the continuous range of numeric values that are single tokens for each LLM."
  • vLLM: A system for efficient LLM serving that manages memory and attention to speed inference. "using vLLM~\cite{kwon2023efficient} without any quantization."
  • Zero-shot: Performing a task without any in-prompt examples, relying only on model priors and the given instruction. "Zero-shot prompts contained only the query equation formatted under one of the five variants."

Practical Applications

Immediate Applications

Below are deployable applications that leverage the paper’s findings on LLMs’ systematic omission of negative signs in subtraction (especially when a<ba<b), the demonstrated efficacy of instruction tuning, and the observed internal encoding of sign despite generation errors.

  • Sign-safe arithmetic middleware for AI products
    • Sectors: software, finance, ecommerce, customer support, education
    • What: Wrap LLM outputs with a “calculator-first” or “sign-check” middleware that: (1) extracts operands, (2) deterministically computes the result (including sign), and (3) reconciles with the LLM’s text before returning an answer.
    • Tools/products/workflows:
    • “Calculator-first” tool use in agent frameworks (function-calling to Python/JS eval)
    • Output validators that reject/substitute answers if sign disagrees with deterministic compute
    • UI nudges that show sign and magnitude separately, e.g., “Sign: NEG, Magnitude: 123 → -123”
    • Assumptions/dependencies: Access to function-calling/tool use; reliable operand parsing; works best when tasks contain explicit numeric expressions.
  • Numeracy QA gates in MLOps and model selection
    • Sectors: software (model evaluation), finance/accounting (risk), education (edtech)
    • What: Add a “subtraction stress test” focusing on a<ba<b and “-b+a” templates in evaluation harnesses to gate model promotion and provider choice.
    • Tools/products/workflows:
    • Test suites with balanced a>ba>b vs a<ba<b cases and prompt variants from the paper
    • “Magnitude-only” vs “magnitude-plus-sign” metrics to localize sign failures
    • Assumptions/dependencies: Benchmarking infra; reproducible prompting (greedy decoding, zero temp).
  • Prefer instruction-tuned models for arithmetic-facing user flows
    • Sectors: finance, productivity apps, assistants, customer support
    • What: Route math-prone tasks to instruction-tuned variants (which the paper shows achieve near-perfect negative-sign handling).
    • Tools/products/workflows:
    • Policy-based router in orchestration layer (if “math-like” → use IT model)
    • Prompt styles that require short final numeric answers (e.g., boxed or strict schemas)
    • Assumptions/dependencies: Availability of instruction-tuned models; cost/latency budgets.
  • Prompt patches to mitigate sign omissions (stopgap)
    • Sectors: chat assistants, education, SMB tooling
    • What: Use minimal few-shot or structured prompts requiring explicit sign fields: “Sign (POS/NEG): _; Magnitude: _; Final: __”
    • Tools/products/workflows:
    • Templates that require sign token first, then value (schema-constrained decoding)
    • Assumptions/dependencies: Few-shot effects are modest/inconsistent per the paper; treat as mitigation, not a fix.
  • Accounting-/finance-safe output formats and checks
    • Sectors: finance, accounting, fintech
    • What: Standardize outputs to display negative values redundantly (e.g., “-123 (negative)”), or in parentheses per accounting style “(123)”, then reconcile with a deterministic calculator.
    • Tools/products/workflows:
    • Double-entry validations; sign-consistency checks across subtotals and totals
    • Assumptions/dependencies: Domain formatting conventions; availability of deterministic recompute.
  • Data curation recipes for post-training
    • Sectors: academia, model training orgs, open-source labs
    • What: Augment instruction-tuning data with targeted negative-result subtraction, including diverse phrasings (-b+a, word problems with debts/temperatures/losses).
    • Tools/products/workflows:
    • Simple generators mirroring the paper’s balanced design and prompt variants
    • Assumptions/dependencies: Access to post-training; dataset licensing/quality control.
  • “Sign-aware” evaluation widgets for educators and learners
    • Sectors: education
    • What: Classroom tools that highlight the non-commutativity of subtraction and the borrowing/sign distinction; auto-detect when an LLM got the magnitude right but sign wrong.
    • Tools/products/workflows:
    • Edtech widgets that parse LLM answers and annotate errors as “sign-only” vs “magnitude”
    • Assumptions/dependencies: Simple integration with calculators; clear pedagogy.
  • Safety and compliance guidance for procurement and deployment
    • Sectors: policy, compliance, regulated industries (finance, healthcare billing)
    • What: Add “negative-result arithmetic” to AI procurement checklists and acceptance criteria; require tool use for numeric outputs.
    • Tools/products/workflows:
    • Model accept/reject criteria; audit logs for when tool use corrected a sign
    • Assumptions/dependencies: Organizational buy-in; mapping to existing risk frameworks.
  • Developer hygiene in code assistants: test synthesis emphasizing a<ba<b
    • Sectors: software engineering
    • What: Code assistants auto-generate unit tests that emphasize negative-return paths and a<ba<b subtraction cases to catch downstream logic bugs.
    • Tools/products/workflows:
    • Test generators that bias toward negative outcomes; mutation tests for sign handling
    • Assumptions/dependencies: CI integration; basic LLM prompt control.
  • Consumer guidance: math tasks routed to calculators by default
    • Sectors: daily life, productivity
    • What: Configure personal assistants and note-taking apps to hand off arithmetic to calculators and return signed results verbatim.
    • Tools/products/workflows:
    • Shortcut/automation rules: “If prompt contains arithmetic → call calculator”
    • Assumptions/dependencies: App supports function calls; user consent for tool use.

Long-Term Applications

These require further research, model access changes, scaling, or productization before widespread deployment.

  • Representation-to-decoding alignment methods (probe-guided decoding)
    • Sectors: software, AI platforms
    • What: Since models internally encode the correct sign, research methods that use internal probes to steer logits during decoding so the sign surfaces in output.
    • Tools/products/workflows:
    • Logit-steering hooks; activation conditioning; “representation-aligned decoding” APIs
    • Assumptions/dependencies: Access to hidden states/weights; provider cooperation.
  • Architecture and tokenizer design for robust signed numbers
    • Sectors: AI model development
    • What: Explore tokenizers/architectures where negative numbers are unified tokens or where sign is a structured feature, improving sign retention.
    • Tools/products/workflows:
    • Tokenizer redesign; embedding features that disentangle sign and magnitude
    • Assumptions/dependencies: Pretraining costs; backward compatibility.
  • Standardized numeracy certification and regulatory tests
    • Sectors: policy, compliance
    • What: Establish industry standards for “numeracy reliability,” with mandatory subtraction/negative-result test suites for LLMs used in finance, billing, and auditing.
    • Tools/products/workflows:
    • Certification programs; third-party audits; reporting of “magnitude vs sign” error rates
    • Assumptions/dependencies: Consensus on metrics; regulator adoption.
  • Curriculum and training strategies for non-commutative operations and borrowing
    • Sectors: academia, AI training organizations
    • What: Incorporate targeted curricula that emphasize non-commutativity and borrowing into pretraining/post-training to generalize beyond subtraction (e.g., signed deltas, gradients).
    • Tools/products/workflows:
    • Synthetic curricula; adversarial data that forces sign fidelity under varied phrasing
    • Assumptions/dependencies: Demonstrated transfer to broader reasoning tasks.
  • Universal “math tool-first” agent patterns for enterprise
    • Sectors: finance, operations, supply chain, analytics
    • What: Enterprise agents that always call verified tools for any numeric transformations (subtractions, deltas, balance updates) with audit trails for sign correctness.
    • Tools/products/workflows:
    • Enterprise orchestration patterns; tool-call policies; verifiers (e.g., MathPrompter-style aggregation)
    • Assumptions/dependencies: Tool coverage; latency budgets; system integration.
  • Robust numeric understanding in long contexts and multi-token ranges
    • Sectors: analytics, legal/contract analysis, procurement
    • What: Extend reliability to multi-token numbers and long documents (the paper shows similar trends, but performance drops), ensuring sign fidelity in large-context operations (e.g., balance sheets, invoices).
    • Tools/products/workflows:
    • Long-context numeric resolvers; structured extraction plus verified arithmetic passes
    • Assumptions/dependencies: Long-context models; scalable parsers.
  • UI/UX standards for “sign-critical” decisions
    • Sectors: fintech, ecommerce, logistics
    • What: Design patterns that require explicit confirmation when a negative value drives an action (refunds, write-downs, penalties), with sign-aware highlighting and cross-checks.
    • Tools/products/workflows:
    • Dual-channel display (sign + magnitude); confirmatory dialogs; risk-tiered flows
    • Assumptions/dependencies: Product changes; user training.
  • Adversarial testing and security hardening against sign exploitation
    • Sectors: security, marketplaces, dynamic pricing
    • What: Red-team suites that probe prompt manipulations causing sign flips or omissions, preventing mispricing or fraud in LLM-mediated workflows.
    • Tools/products/workflows:
    • Adversarial benchmarks targeting a<ba<b and negative deltas; anomaly detection on signed outputs
    • Assumptions/dependencies: Access to logs; feedback loops.
  • Domain-adapted instruction tuning for specialized numeracy
    • Sectors: healthcare billing, energy metering, accounting standards
    • What: Fine-tune models with domain-specific subtraction cases (e.g., net energy consumption, dosage adjustments, write-offs) and strict output schemas.
    • Tools/products/workflows:
    • Domain corpora with abundant negative results; schema-constrained decoding
    • Assumptions/dependencies: High-quality labeled data; governance for sensitive data.
  • Research on decoding constraints and grammar-guided generation for signed numbers
    • Sectors: AI platforms
    • What: Grammar-based decoding that enforces the presence/absence of a minus sign based on upstream deterministic checks or structured intermediate representations.
    • Tools/products/workflows:
    • Constrained decoding APIs; structured intermediate outputs (Sign, Magnitude) compiled into final text
    • Assumptions/dependencies: Widespread availability of constrained decoding; reliable intermediate extraction.

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Collections

Sign up for free to add this paper to one or more collections.

HackerNews

  1. Can LLMs Subtract Numbers? (2 points, 0 comments)