Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models (2403.02178v2)

Published 4 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In reasoning tasks, even a minor error can cascade into inaccurate results, leading to suboptimal performance of LLMs in such domains. Earlier fine-tuning approaches sought to mitigate this by leveraging more precise supervisory signals from human labeling, larger models, or self-sampling, although at a high cost. Conversely, we develop a method that avoids external resources, relying instead on introducing perturbations to the input. Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks. When applied to fine-tuning with GSM8K on Llama-2-7B, this method achieved a 5\% improvement in GSM8K accuracy and a 10\% improvement in GSM-IC accuracy over standard supervised fine-tuning with a few codes modified. Furthermore, it is complementary to existing methods. When integrated with related explicit data augmentation methods, it leads to improvements across five datasets of various augmentation methods, as well as two different base models. We further investigate the mechanisms behind this improvement through case studies and quantitative analysis, suggesting that our approach may provide superior support for the model in capturing long-distance dependencies, especially those related to questions. This enhancement could deepen understanding of the premises in questions and prior steps. Our code is available at Github.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Learning from mistakes makes llm better reasoner. ArXiv, abs/2310.20689.
  3. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631.
  4. Unilmv2: Pseudo-masked language models for unified language model pre-training. In International Conference on Machine Learning.
  5. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28.
  6. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687.
  7. Semi-offline reinforcement learning for optimized text generation. arXiv preprint arXiv:2306.09712.
  8. An empirical survey of data augmentation for limited data learning in nlp. Transactions of the Association for Computational Linguistics, 11:191–211.
  9. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  10. Premise order matters in reasoning with large language models. arXiv preprint arXiv:2402.08939.
  11. Towards robust neural machine translation. ArXiv, abs/1805.06130.
  12. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics.
  14. Faith and fate: Limits of transformers on compositionality. ArXiv, abs/2305.18654.
  15. Data augmentation for low-resource neural machine translation. arXiv preprint arXiv:1705.00440.
  16. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
  17. Yarin Gal and Zoubin Ghahramani. 2015. A theoretically grounded application of dropout in recurrent neural networks. In Neural Information Processing Systems.
  18. Tora: A tool-integrated reasoning agent for mathematical problem solving. ArXiv, abs/2309.17452.
  19. Sequence-level mixed sample data augmentation. arXiv preprint arXiv:2011.09039.
  20. Sequence-level mixed sample data augmentation. In Conference on Empirical Methods in Natural Language Processing.
  21. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  22. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  23. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.
  24. Neftune: Noisy embeddings improve instruction finetuning.
  25. Mistral 7b. arXiv preprint arXiv:2310.06825.
  26. Chris Kedzie and Kathleen McKeown. 2019. A good sample is hard to find: Noise injection sampling and self-training for neural language generation models. In International Conference on Natural Language Generation.
  27. Towards an understanding of stepwise inference in transformers: A synthetic graph navigation model. arXiv preprint arXiv:2402.07757.
  28. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  29. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.
  30. Camel: Communicative agents for" mind" exploration of large scale language model society. arXiv preprint arXiv:2303.17760.
  31. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  32. Making language models better reasoners with step-aware verifier. In Annual Meeting of the Association for Computational Linguistics.
  33. Mario: Math reasoning with code interpreter output - a reproducible pipeline. ArXiv, abs/2401.08190.
  34. Let’s verify step by step. ArXiv, abs/2305.20050.
  35. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
  36. Transformers learn shortcuts to automata. ArXiv, abs/2210.10749.
  37. Tinygsm: achieving> 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241.
  38. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  39. Tsvetomila Mihaylova and André FT Martins. 2019. Scheduled sampling for transformers. arXiv preprint arXiv:1906.07651.
  40. Why think step-by-step? reasoning emerges from the locality of experience. ArXiv, abs/2304.03843.
  41. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  42. Patrick Schwab and Walter Karlen. 2019. CXPlain: Causal Explanations for Model Interpretation under Uncertainty. In Advances in Neural Information Processing Systems (NeurIPS).
  43. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pages 31210–31227. PMLR.
  44. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585.
  45. S4rl: Surprisingly simple self-supervision for offline reinforcement learning in robotics. In Conference on Robot Learning, pages 907–917. PMLR.
  46. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929–1958.
  47. RoBERTa-based traditional Chinese medicine named entity recognition model. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), pages 61–66, Taipei, Taiwan. The Association for Computational Linguistics and Chinese Language Processing (ACLCLP).
  48. Stanford alpaca: An instruction-following llama model.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  50. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275.
  51. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  52. Math-shepherd: A label-free step-by-step verifier for llms in mathematical reasoning. arXiv preprint arXiv:2312.08935.
  53. Switchout: an efficient data augmentation algorithm for neural machine translation. In Conference on Empirical Methods in Natural Language Processing.
  54. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  55. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  56. Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Conference on Empirical Methods in Natural Language Processing.
  57. Training large language models for reasoning through reverse curriculum reinforcement learning. arXiv preprint arXiv:2402.05808.
  58. Target-side input augmentation for sequence to sequence generation. In International Conference on Learning Representations.
  59. Data noising as smoothing in neural network language models. In International Conference on Learning Representations.
  60. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  61. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
  62. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  63. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825.
  64. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  65. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488.
  66. Character-level convolutional networks for text classification. In Neural Information Processing Systems.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Changyu Chen (19 papers)
  2. Xiting Wang (42 papers)
  3. Ting-En Lin (28 papers)
  4. Ang Lv (19 papers)
  5. Yuchuan Wu (33 papers)
  6. Xin Gao (208 papers)
  7. Ji-Rong Wen (299 papers)
  8. Rui Yan (250 papers)
  9. Yongbin Li (128 papers)
Citations (8)