Adversarial Math Word Problems

Updated 14 September 2025

Adversarial math word problems are specially designed puzzles that introduce subtle linguistic and numerical modifications to challenge automated reasoning systems.
Construction methodologies such as numeric perturbations, linguistic reordering, and distractor addition have reduced solver accuracies by 26-40% and achieved attack success rates up to 100% in some cases.
Defense strategies like adversarial training, structural graph modeling, and chain-of-thought approaches improve resilience by 7-8%, yet achieving human-level consistency remains a challenge.

Adversarial math word problems are mathematical word problem instances intentionally constructed or perturbed to confound automated solvers, particularly neural and LLMs, by disrupting their reliance on shallow heuristics, surface-level cues, or fixed reasoning patterns. The current research trajectory on adversarial math word problems spans robust benchmarking, data augmentation, attack and defense methodologies, and foundational studies on the semantic and symbolic underpinnings of mathematical reasoning within natural language.

1. Defining Adversarial Math Word Problems

Adversarial math word problems (MWPs) present challenges by introducing minimal yet potent modifications—lexical, syntactic, or numerical—that preserve the core mathematical solution for a human but degrade automated solver performance. These adversarial instances may involve paraphrasing, question reordering, extraneous or reordered information, relevant and irrelevant distractors, or perturbation of numeric values while maintaining solution validity. Their purpose is to reveal, and quantitatively evaluate, the limits of automated reasoning systems’ generalization and true understanding beyond memorized language patterns or dataset bias (Kumar et al., 2021, Xie et al., 27 Feb 2024, Anantheswaran et al., 30 May 2024).

2. Adversarial Construction Methodologies

A variety of adversarial construction strategies have been employed:

Linguistic Reordering and Paraphrasing: Reordering the position of the question within the narrative or generating linguistically diverse paraphrases of problem statements can lead to marked performance drops in neural solvers. For instance, paraphrased or reordered MWPs have been shown to reduce solver accuracy by over 40 percentage points (Kumar et al., 2021).
Numeric Perturbation via Abstract Syntax Trees (ASTs): Directly editing numeric literals in the computational AST of a solution, while preserving the operational structure, yields adversarial variants. Controlled constraints preserve numerical types, digit counts, or order-of-magnitude, ensuring logical coherence. These numeric attacks have demonstrated superior effectiveness compared to baseline rephrasing, causing attack success rates up to 100% in some LLMs (Xie et al., 27 Feb 2024).
Addition of Irrelevant Numerical Variables: Frameworks such as ProbleMathic generate adversarial MWPs by adding spurious numeric values with disjoint units (e.g., temperature in a length/volume MWP) that act purely as distractors. This strategy creates structurally invariant but semantically noisier problems, resulting in relative accuracy drops of ~26% for strong LLMs (Anantheswaran et al., 30 May 2024).
Reverse Operation and Logic-level Augmentation: Approaches like RODA swap the roles of known and unknown quantities, and invert the direction of solution equations, thereby introducing new “knowledge points” and alternative solution strategies. This logic-level augmentation both enriches training signals and increases robustness to adversarial presentation (Liu et al., 2020).
Extension of Context: The E-GSM dataset and CoLeG metrics assess adversarial resilience as the narrative context length increases. When extraneous but semantically consistent information is embedded, solvers’ performance typically degrades unless models are explicitly tuned for “condition retrieval” (Xu et al., 23 May 2024).

3. Evaluations and Robustness Metrics

Robustness of MWP solvers to adversarial attacks is measured primarily via:

Original Accuracy (OA): Accuracy on unperturbed problems.
Attack Accuracy (AA): Accuracy after adversarial modification.
Attack Success Rate (ASR): The percentage of originally solved problems which fail post-attack, i.e., ASR = 1 – (AA/OA).
Relative Performance Drop: Quantifies the magnitude of accuracy reduction attributable to adversarial distraction or perturbation (e.g., ~26% on ProbleMathic’s adversarial set (Anantheswaran et al., 30 May 2024)).
CoLeG-E and CoLeG-R: Efficacy and robustness metrics for extended-context adversarial settings, representing the average accuracy across increasingly elaborate narrative versions (Xu et al., 23 May 2024).

High transferability of adversarial examples is observed: samples that fail one LLM often confound other models as well, revealing systematic weaknesses and shared vulnerabilities, especially for complex, multi-step MWPs (Zhou et al., 2023, Xie et al., 27 Feb 2024). Human evaluations confirm the validity and semantic fidelity of adversarial constructions (Kumar et al., 2021).

4. Defense Strategies and Robust Solver Architectures

Advanced solvers and training paradigms aim to mitigate adversarial susceptibility:

Structural and Semantic Graph Modeling: Unit Dependency Graphs (UDGs) (Roy et al., 2016), Heterogeneous Line Graph Transformers (HLGT) (Hu et al., 2022), and symbolic reasoning modules (NS-Solver (Qin et al., 2021)) encode explicit semantic relationships, units, and symbolic constraints, thereby reducing sensitivity to surface-level linguistic changes.
Memory-Augmented and Analogical Learning: Techniques that recall similar prior problems (e.g., REAL (Huang et al., 2021)), or generate multiple linguistic or logical variants for voting (e.g., parametric majority-vote on paraphrases (Raiyan et al., 2023)), promote robust decision making by leveraging analogical generalization.
Numeracy-Augmented Pretraining: MWP-BERT (Liang et al., 2021) and similar models inject numerical properties and contextual signals into representations, improving resilience to subtle numerical changes.
Self-Consistent and Verifier-based Approaches: Training with self-consistent reasoning (SCR (Xiong et al., 2022)) or integrating a stand-alone verification module that ranks candidate solutions (Cobbe et al., 2021) improves system robustness by counteracting spurious correlations and output distribution drift.
Adversarial Training: Direct fine-tuning with adversarial samples—whether derived from numeric perturbations, linguistic noise, or irrelevant distractors—yields 7–8% performance improvements on challenging datasets (Anantheswaran et al., 30 May 2024).
Chain-of-Thought Externalization and Execution: Frameworks such as POET (Lin, 26 May 2025) explicitly generate stepwise equations and delegate computation to symbolic tools (e.g., Sympy via Python), minimizing cumulative LLM arithmetic errors even when the problem has been subject to adversarial perturbation.

5. Implications for Dataset Design and Future Evaluation

Recent research underscores that widely used benchmarks (Math23K, GSM8K, SVAMP, MultiAirth, etc.) provide insufficient coverage of real-world adversarial phenomena unless explicitly augmented (Kumar et al., 2021, Xie et al., 27 Feb 2024, Anantheswaran et al., 30 May 2024). Purpose-built datasets such as RobustMath, PARAMAWPS, ProbleMathic, GSM-8K-Adv, E-GSM, and DiverseMath23K have emerged to bridge this gap, often featuring adversarially crafted paraphrases, numeric perturbations, extended contexts, and irrelevant distractors (Zhou et al., 2023, Raiyan et al., 2023, Xie et al., 27 Feb 2024, Xu et al., 23 May 2024, Anantheswaran et al., 30 May 2024). These datasets serve as benchmarks for stress-testing both the versatility and the failure modes of MWP solvers.

6. Broader Applications and Theoretical Directions

Adversarial MWP research exemplifies the interplay between natural language understanding, symbolic reasoning, and numerical cognition within NLP systems. Progress in this area has direct implications for robust educational assessment, secure AI-assisted homework tools, and trustworthy quantitative reasoning in high-stakes domains. Structural approaches—especially those embedding explicit graph- and unit-level logic—point toward integration with broader symbolic or neuro-symbolic reasoning systems.

Theoretical advancements center on aligning model representations with human-interpretable, context-invariant logic and on reducing the reliance of LLMs on spurious, non-causal statistical correlations. This research also points to the persistent challenge of “data sparsity” for highly compositional and context-sensitive adversarial examples, motivating both automated data augmentation (via logic- or syntax-based inversion and extension) and new forms of modular, verifiable solver architectures.

7. Limitations and Open Problems

Despite notable improvements, persistent limitations remain:

No architecture to date attains reliability on adversarial benchmarks approaching human consistency.
Numeric and linguistic perturbations can still selectively degrade accuracy, even in math-tuned or verifier-augmented systems (Xie et al., 27 Feb 2024, Anantheswaran et al., 30 May 2024).
Existing defense strategies, such as adversarial training, may struggle with out-of-distribution adversarial constructions or extreme context extension without further scaling.
Transferability analyses indicate that shared vulnerabilities in LLMs are not straightforwardly addressed by scaling alone (Xie et al., 27 Feb 2024, Zhou et al., 2023).
Many robustness techniques (e.g., those based on structural graph representations) require high-quality semantic role labeling or unit extraction, which may themselves be brittle under adversarial attack (Hu et al., 2022, Roy et al., 2016).

Future directions include improved identification and filtering of irrelevant content, more nuanced adversarial data augmentation, dynamically adaptive solvers, and the development of meta-evaluation frameworks that can systematically probe solver resiliency at every stage of the reasoning pipeline.