Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can LLMs Learn from Previous Mistakes? Investigating LLMs' Errors to Boost for Reasoning (2403.20046v2)

Published 29 Mar 2024 in cs.CL

Abstract: Recent works have shown the benefits to LLMs from fine-tuning golden-standard Chain-of-Thought (CoT) rationales or using them as correct examples in few-shot prompting. While humans can indeed imitate correct examples, learning from our mistakes is another vital aspect of human cognition. Hence, a question naturally arises: \textit{can LLMs learn and benefit from their mistakes, especially for their reasoning? } This study investigates this problem from both the prompting and model-tuning perspectives. We begin by introducing \textsc{CoTErrorSet}, a new benchmark with 609,432 questions, each designed with both correct and error references, and demonstrating the types and reasons for making such mistakes. To explore the effectiveness of those mistakes, we design two methods: (1) \textbf{Self-rethinking} prompting guides LLMs to rethink whether they have made similar previous mistakes; and (2) \textbf{Mistake tuning} involves finetuning models in both correct and incorrect reasoning domains, rather than only tuning models to learn ground truth in traditional methodology. We conduct a series of experiments to prove LLMs can obtain benefits from mistakes in both directions. Our two methods offer potentially cost-effective strategies by leveraging errors to enhance reasoning capabilities, which costs significantly less than creating meticulously hand-crafted golden references. We ultimately make a thorough analysis of the reasons behind LLMs' errors, which provides directions that future research needs to overcome. \textsc{CoTErrorSet} will be published soon on \texttt{\url{https://github.com/YookiTong/Learn-from-Mistakes-CotErrorSet}}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Explanations for commonsenseqa: New dataset and models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3050–3065.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
  3. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. arXiv preprint arXiv:1905.13319.
  4. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689.
  5. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  6. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  8. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
  9. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
  10. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  12. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  13. Antonia Creswell and Murray Shanahan. 2022. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271.
  14. Xiang Gao and Kamalika Das. 2024. Customizing language model responses with contrastive in-context learning. arXiv preprint arXiv:2401.17390.
  15. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  16. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR.
  17. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
  18. Large language models can self-improve. arXiv preprint arXiv:2210.11610.
  19. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  20. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858.
  21. Qasc: A dataset for question answering via sentence composition. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8082–8090.
  22. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045.
  23. Pretraining language models with human preferences. In International Conference on Machine Learning, pages 17506–17533. PMLR.
  24. Qed: A framework and dataset for explanations in question answering. Transactions of the Association for computational Linguistics, 9:790–806.
  25. Boosting logical reasoning in large language models through a new framework: The graph of thought. arXiv preprint arXiv:2308.08614.
  26. Dail: Data augmentation for in-context learning via self-paraphrase. arXiv preprint arXiv:2311.03319.
  27. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333.
  28. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  29. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
  30. Program induction by rationale generation: Learning to solve and explain algebraic word problems. arXiv preprint arXiv:1705.04146.
  31. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning. arXiv preprint arXiv:2007.08124.
  32. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  33. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511.
  34. Neil Mercer. 2008. Talk and the development of reasoning and understanding. Human development, 51(1):90–100.
  35. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
  36. Creak: A dataset for commonsense reasoning over entity knowledge. arXiv preprint arXiv:2109.01653.
  37. OpenAI. 2023. Gpt-4 technical report.
  38. Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies. arXiv preprint arXiv:2308.03188.
  39. How to overcome algorithm aversion: Learning from mistakes. Journal of Consumer Psychology, 33(2):285–302.
  40. Eliminating reasoning via inferring with planning: A new framework to guide llms’ non-linear thinking. arXiv preprint arXiv:2310.12342.
  41. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  42. Does it make sense? and why? a pilot study for sense making and explanation. arXiv preprint arXiv:1906.00363.
  43. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  44. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  45. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  47. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yongqi Tong (8 papers)
  2. Dawei Li (75 papers)
  3. Sizhe Wang (24 papers)
  4. Yujia Wang (29 papers)
  5. Fei Teng (134 papers)
  6. Jingbo Shang (141 papers)
Citations (31)