Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Frugal LMs Trained to Invoke Symbolic Solvers Achieve Parameter-Efficient Arithmetic Reasoning (2312.05571v2)

Published 9 Dec 2023 in cs.AI and cs.LG

Abstract: LLMs (LLM) exhibit zero-shot mathematical reasoning capacity as a behavior emergent with scale, commonly manifesting as chain-of-thoughts (CoT) reasoning. However, multiple empirical findings suggest that this prowess is exclusive to LLMs with exorbitant sizes (beyond 50 billion parameters). Meanwhile, educational neuroscientists suggest that symbolic algebraic manipulation be introduced around the same time as arithmetic word problems to modularize language-to-formulation, symbolic manipulation of the formulation, and endgame arithmetic. In this paper, we start with the hypothesis that much smaller LMs, which are weak at multi-step reasoning, can achieve reasonable arithmetic reasoning if arithmetic word problems are posed as a formalize-then-solve task. In our architecture, which we call SYRELM, the LM serves the role of a translator to map natural language arithmetic questions into a formal language (FL) description. A symbolic solver then evaluates the FL expression to obtain the answer. A small frozen LM, equipped with an efficient low-rank adapter, is capable of generating FL expressions that incorporate natural language descriptions of the arithmetic problem (e.g., variable names and their purposes, formal expressions combining variables, etc.). We adopt policy-gradient reinforcement learning to train the adapted LM, informed by the non-differentiable symbolic solver. This marks a sharp departure from the recent development in tool-augmented LLMs, in which the external tools (e.g., calculator, Web search, etc.) are essentially detached from the learning phase of the LM. SYRELM shows massive improvements (e.g., +30.65 absolute point improvement in accuracy on the SVAMP dataset using GPT-J 6B model) over base LMs, while keeping our testbed easy to diagnose, interpret and within reach of most researchers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Borji, A. 2023. A Categorical Archive of ChatGPT Failures. arXiv. arXiv:2302.03494.
  2. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., NeurIPS, volume 33, 1877–1901.
  3. Data Generation for Testing and Grading SQL Queries. The VLDB Journal, 24(6): 731–755.
  4. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  6. PaLM: Scaling Language Modeling with Pathways. arXiv:2204.02311.
  7. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168.
  8. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 4171–4186.
  9. PAL: Program-aided Language Models. arXiv preprint arXiv:2211.10435.
  10. Gunzelmann, B. A. 2005. Toxic Testing: It’s Time to Reflect upon Our Current Testing Practices. Educational Horizons, 83(3): 212––20.
  11. LoRA: Low-Rank Adaptation of Large Language Models. CoRR, abs/2106.09685.
  12. Large Language Models are Zero-Shot Reasoners. In NeurIPS.
  13. Dissociating language and thought in large language models: a cognitive perspective. arXiv:2301.06627.
  14. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In ACL, 975–984.
  15. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. In ACL, 975–984. Online.
  16. Narayanan, A. 2022. Multiplication using ChatGPT. (Tweet).
  17. A Theory of Algebra-Word-Problem Comprehension and Its Implications for the Design of Learning Environments. Cognition and Instruction, 9(4): 329–389.
  18. OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
  19. Training language models to follow instructions with human feedback. In NeurIPS, volume 35, 27730–27744.
  20. ART: Automatic multi-step reasoning and tool-use for large language models. arXiv.
  21. Are NLP Models really able to Solve Simple Math Word Problems? In NAACL, 2080–2094. Online.
  22. Deep contextualized word representations. NAACL Conference, abs/1802.05365.
  23. Does Early Algebraic Reasoning Differ as a Function of Students’ Difficulty with Calculations versus Word Problems? Learn. Disabil. Res. Pract., 29(3): 106–116.
  24. Making Language Models Better Tool Learners with Execution Feedback. ArXiv:2305.13068 [cs].
  25. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
  26. Proximal Policy Optimization Algorithms. CoRR, abs/1707.06347.
  27. Generate & Rank: A Multi-task Framework for Math Word Problems. In EMNLP Findings, 2269–2279.
  28. Learning to summarize with human feedback. In NeurIPS, volume 33, 3008–3021. Curran Associates, Inc.
  29. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
  30. Neural Arithmetic Logic Units. In NeurIPS.
  31. Do NLP Models Know Numbers? Probing Numeracy in Embeddings. In EMNLP Conference.
  32. Wang, B. 2021. Mesh-Transformer-JAX: Model-Parallel Implementation of Transformer Language Model with JAX. https://github.com/kingoflolz/mesh-transformer-jax.
  33. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
  34. Chain of Thought Prompting Elicits Reasoning in Large Language Models. CoRR, abs/2201.11903.
  35. Wikipedia. 2023. Monty Hall problem — Wikipedia, The Free Encyclopedia. http://en.wikipedia.org/w/index.php?title=Monty“%20Hall“%20problem&oldid=1165742647. [Online; accessed 09-August-2023].
  36. Wolfram, S. 2023. ChatGPT Gets Its “Wolfram Superpowers”! (blog).
  37. RRHF: Rank Responses to Align Language Models with Human Feedback without tears. arXiv:2304.05302.
  38. Progressive-Hint Prompting Improves Reasoning in Large Language Models. arXiv:2304.09797.
  39. Teaching Algorithmic Reasoning via In-context Learning. arXiv:2211.09066.
  40. NaRLE: Natural Language Models using Reinforcement Learning with Emotion Feedback. CoRR, abs/2110.02148.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Subhabrata Dutta (24 papers)
  2. Joykirat Singh (8 papers)
  3. Ishan Pandey (5 papers)
  4. Sunny Manchanda (5 papers)
  5. Soumen Chakrabarti (52 papers)
  6. Tanmoy Chakraborty (224 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com