Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Guiding Language Model Reasoning with Planning Tokens (2310.05707v4)

Published 9 Oct 2023 in cs.CL, cs.AI, and cs.LG
Guiding Language Model Reasoning with Planning Tokens

Abstract: LLMs have recently attracted considerable interest for their ability to perform complex reasoning tasks, such as chain-of-thought (CoT) reasoning. However, most of the existing approaches to enhance this ability rely heavily on data-driven methods, while neglecting the structural aspects of the model's reasoning capacity. To encourage a more structural generation of CoT steps, we propose a hierarchical generation scheme: we let the LM generate a planning token at the start of each reasoning step, intuitively serving as a high-level plan of the current step, and add their embeddings to the model parameters. Our approach requires a negligible increase in trainable parameters (0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme. We demonstrate our method's effectiveness by applying it to three different LLMs, showing notable accuracy improvements across three math word problem datasets and one multihop QA dataset with respect to standard fine-tuning baselines.

Guiding LLM Math Reasoning with Planning Tokens

The paper "Guiding LLM Math Reasoning with Planning Tokens" introduces a novel approach for enhancing the mathematical reasoning capabilities of LLMs through the use of planning tokens. The authors address a notable limitation in existing LLMs: despite their proficiency in managing discrete reasoning steps, these models often exhibit inconsistency across an entire reasoning chain. This inconsistency undermines the ability of LLMs to perform complex reasoning tasks reliably.

Methodology and Approach

The proposed solution involves the introduction of planning tokens at the beginning of each reasoning step, which are embedded into the model parameters. These tokens act as guides that encapsulate high-level solution plans, thus aiding the model in maintaining coherence across multiple reasoning steps. The methodology is highlighted by its minimal increase in trainable parameters, which is only 0.001%, making it a computationally efficient enhancement.

The authors employ both conventional fine-tuning and parameter-efficient schemes to incorporate these planning tokens into various LLMs. The idea is grounded in recent theoretical advances that suggest adding intermediate tokens can enhance the reasoning capacity of transformers. By increasing the length of chain-of-thoughts (CoTs), the models acquire a greater capacity to resolve complex reasoning problems like those found in mathematics.

Experiments and Results

The experimental validation of the approach involves testing on three distinct math word problem datasets: GSM8K, AQUA, and MATH. Three LLMs are evaluated: Phi 1.5, Llama2 (7B), and Llama2 (13B). The introduction of planning tokens led to notable improvements in accuracy across all datasets and models in comparison to standard fine-tuning procedures. Specifically, the inclusion of planning tokens produced an average accuracy rise of 3.3 percentage points.

The results demonstrate that planning tokens enhance the ability of LLMs to solve math problems, with the greatest improvements observed in longer and more complex reasoning chains. The soft Q-VAE approach to inferring planning tokens consistently outperformed these more basic heuristic approaches, showcasing the advantage of learned planning specialization.

Implications and Future Work

The findings have practical implications for developing LLMs that are more robust and reliable in handling structured reasoning tasks, beyond simple information synthesis. The introduction of planning tokens could extend beyond mathematical reasoning tasks, potentially enhancing LLM performance in a variety of domains requiring coherent multi-step logic.

Future research may explore variants of planning tokens across different problem types or further develop the latent inference approach to obtain more expressive and accurate planning tokens. Additionally, exploring interpretability in the use of planning tokens might provide insights into the internal reasoning strategies of LLMs, facilitating a better understanding of how these models can mimic human-like reasoning processes.

In summary, the integration of planning tokens into LLMs presents a simple yet effective means of enhancing their reasoning capacities, with minimal computational overhead. This methodological shift guides the advancement of more cohesively thinking models, opening doors to broader applications of AI in logic-intensive domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Unsupervised domain clusters in pretrained language models. arXiv preprint arXiv:2004.02105, 2020.
  2. Understanding intermediate layers using linear classifier probes. ArXiv, abs/1610.01644, 2017.
  3. Composable sparse fine-tuning for cross-lingual transfer. arXiv preprint arXiv:2110.07560, 2021.
  4. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  5. Daniel G Bobrow. A question-answering system for high school algebra word problems. In Proceedings of the October 27-29, 1964, fall joint computer conference, part I, pp.  591–614, 1964.
  6. Deep clustering for unsupervised learning of visual features. In Proceedings of the European conference on computer vision (ECCV), pp.  132–149, 2018.
  7. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  8. Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530, 2023.
  9. Efficient hierarchical domain adaptation for pretrained language models. arXiv preprint arXiv:2112.08786, 2021.
  10. Adaptersoup: Weight averaging to improve generalization of pretrained language models. arXiv preprint arXiv:2302.07027, 2023.
  11. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  12. Enslm: Ensemble language model for data diversity by semantic clustering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  2954–2967, 2021.
  13. Charles R Fletcher. Understanding and solving arithmetic word problems: A computer simulation. Behavior Research Methods, Instruments, & Computers, 17(5):565–571, 1985.
  14. Z-forcing: Training stochastic recurrent networks, 2017.
  15. Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  6865–6873, 2017.
  16. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  17. Scaling expert language models with unsupervised domain discovery. arXiv preprint arXiv:2303.14177, 2023.
  18. Measuring mathematical problem solving with the MATH dataset. CoRR, abs/2103.03874, 2021. URL https://arxiv.org/abs/2103.03874.
  19. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  20. Lora: Low-rank adaptation of large language models. 6 2021. URL http://arxiv.org/abs/2106.09685.
  21. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
  22. Prompt waywardness: The curious case of discretized interpretation of continuous prompts. arXiv preprint arXiv:2112.08348, 2021.
  23. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045, 2023.
  24. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, 2014.
  25. Platypus: Quick, cheap, and powerful refinement of llms. 8 2023. URL http://arxiv.org/abs/2308.07317.
  26. The power of scale for parameter-efficient prompt tuning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
  27. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  28. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
  29. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  5315–5333, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.291. URL https://aclanthology.org/2023.acl-long.291.
  30. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  158–167, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1015. URL https://aclanthology.org/P17-1015.
  31. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  32. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  33. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  34. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  35. Balanced k-means for clustering. In Structural, Syntactic, and Statistical Pattern Recognition: Joint IAPR International Workshop, S+ SSPR 2014, Joensuu, Finland, August 20-22, 2014. Proceedings, pp.  32–41. Springer, 2014.
  36. Discovering discrete latent topics with neural variational inference. In International conference on machine learning, pp.  2410–2419. PMLR, 2017.
  37. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
  38. Orca: Progressive learning from complex explanation traces of gpt-4, 2023.
  39. Neural discrete representation learning. arXiv preprint arXiv:1711.00937, 2017.
  40. Invertible gaussian reparameterization: Revisiting the gumbel-softmax. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  12311–12321. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/90c34175923a36ab7a5de4b981c1972f-Paper.pdf.
  41. Learning how to ask: Querying LMs with mixtures of soft prompts. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5203–5212, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.410. URL https://aclanthology.org/2021.naacl-main.410.
  42. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
  43. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pp.  4596–4604. PMLR, 2018.
  44. Generate & rank: A multi-task framework for math word problems. arXiv preprint arXiv:2109.03034, 2021.
  45. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  46. Mpnet: Masked and permuted pre-training for language understanding. Advances in Neural Information Processing Systems, 33:16857–16867, 2020.
  47. Deep language networks: Joint prompt training of stacked llms using variational inference. arXiv preprint arXiv:2306.12509, 2023.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. Mathdqn: Solving arithmetic word problems via deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  50. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  51. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pp.  845–854, 2017.
  52. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration. arXiv preprint arXiv:2307.05300, 2023.
  53. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  54. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  55. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  56. Scaling relationship on learning mathematical reasoning with large language models. 8 2023. URL http://arxiv.org/abs/2308.01825.
  57. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
  58. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pp.  1–21, 2023.
  59. Interpretable math word problem solution generation via step-by-step planning. 6 2023. URL http://arxiv.org/abs/2306.00784.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xinyi Wang (152 papers)
  2. Lucas Caccia (22 papers)
  3. Oleksiy Ostapenko (10 papers)
  4. Xingdi Yuan (46 papers)
  5. Alessandro Sordoni (53 papers)
  6. William Yang Wang (254 papers)
Citations (9)