Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarial Math Word Problem Generation (2402.17916v3)

Published 27 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs have significantly transformed the educational landscape. As current plagiarism detection tools struggle to keep pace with LLMs' rapid advancements, the educational community faces the challenge of assessing students' true problem-solving abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation -- generating adversarial examples which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs, quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis to investigate the cause of failure, providing further insights into the limitations of LLMs.

Generating Adversarial Math Word Problems to Challenge LLMs

Introduction to Adversarial Generation in Educational Contexts

LLMs have made significant strides in solving Math Word Problems (MWPs), a development that poses both opportunities and challenges in education. While these models can assist in learning and problem-solving, they also raise concerns about fair student evaluation and the potential for academic dishonesty. This paper introduces a novel approach to generate MWPs that LLMs find challenging to solve, by editing numeric values within the problems while preserving their original structure and difficulty. The goal is not simply to create problems that LLMs get wrong, but to ensure that these problems remain relevant and beneficial from an educational standpoint.

Methodological Insights

The methodology leverages Abstract Syntax Trees (ASTs) to systematically modify MWPs, focusing on numeric values alteration. This approach transcends simple adversarial prompt modifications and dives into the structural generation of problems. The paper delineates a sophisticated mechanism to ensure the generated adversarial examples maintain coherence with the original problems, emphasizing educational integrity. Three distinct generation methods (M3, M2, M1) are proposed, varying in the degree of restrictiveness concerning the original problem's difficulty and format. M3 emerges as the primary method given its balance between generating challenging adversarial examples and preserving educational value.

Experimental Results

Through comprehensive experiments on a blend of open- and closed-source LLMs, the paper reveals a significant decline in math problem-solving performance across multiple models upon exposure to adversarial examples. Notably, even the application of the most restrictive generation method — M3 — leads to a considerable degradation in performance, indicating a potential avenue for creating educational assignments resistant to LLM's problem-solving capabilities. The paper also explores the phenomena of universal attacks and model transferability, offering insights into the shared vulnerabilities among LLMs and proposing efficient strategies for attacking high-cost models.

Further Analysis

The paper extends its analysis to investigate the characteristics of MWPs that contribute to LLM failures. It highlights the impact of specific problem features, like the number of operations and the presence of division, on model performance. A notable takeaway is the significant influence of the answer’s value range on correctness, underscoring the nuanced relationship between problem complexity and LLMs’ problem-solving prowess.

Ethical and Educational Implications

The research presents a dual facet to the ethical use of LLMs in education. On one hand, it aids in creating LLM-resilient educational materials that encourage genuine learning. On the other, it prompts a reflective discourse on maintaining educational equity in light of advancing AI capabilities. These findings emphasize the need for continuous innovation in educational tools and methodologies to align with rapid advancements in AI technology.

Conclusion and Future Directions

This paper sets a foundation for generating adversarial examples in education that challenge LLMs while maintaining the integrity and relevance of educational assessments. The research not only contributes to the understanding of LLMs' limitations in solving MWPs but also proposes methodologies that can be applied beyond the math domain. As LLMs continue to evolve, so must the strategies for leveraging these technologies in educational contexts, ensuring they complement rather than compromise the learning process.

In conclusion, this work does not only showcase a method to stress-test the capabilities of LLMs in educational settings but also opens the floor for further exploration into secure and fair methods of utilizing AI in academic evaluations. This exploration is crucial for ensuring that advancements in AI serve to enrich the educational experience, rather than undermine its fundamental objectives.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  2. Can gpt models be financial analysts? an evaluation of chatgpt and gpt-4 on mock cfa exams. arXiv preprint arXiv:2310.08678.
  3. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447.
  4. Chaka Chaka. 2023. Detecting ai content in responses generated by chatgpt, youchat, and chatsonic: The case of five ai content detection tools. Journal of Applied Learning and Teaching, 6(2).
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  7. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  8. Solving math word problems by combining language models with symbolic solvers. arXiv preprint arXiv:2304.09102.
  9. Adversarial examples are not bugs, they are features. Advances in neural information processing systems, 32.
  10. Mathprompter: Mathematical reasoning using large language models. In Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics: Industry Track, ACL 2023, Toronto, Canada, July 9-14, 2023, pages 37–42. Association for Computational Linguistics.
  11. Mistral 7b. arXiv preprint arXiv:2310.06825.
  12. A watermark for large language models. arXiv preprint arXiv:2301.10226.
  13. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198.
  14. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474.
  15. Gpt detectors are biased against non-native english writers. arXiv preprint arXiv:2304.02819.
  16. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  17. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. Accessed: 2023-01-10.
  18. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  19. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
  20. Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics, 3:1–13.
  21. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  22. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156.
  23. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  24. Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840.
  25. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  26. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  27. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  28. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41.
  29. Mathattack: Attacking large language models towards math solving ability. arXiv preprint arXiv:2309.01686.
  30. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
  31. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Roy Xie (8 papers)
  2. Chengxuan Huang (2 papers)
  3. Junlin Wang (34 papers)
  4. Bhuwan Dhingra (66 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com