Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving (2405.12205v1)

Published 20 May 2024 in cs.AI and cs.LG
Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

Abstract: Metacognitive knowledge refers to humans' intuitive knowledge of their own thinking and reasoning processes. Today's best LLMs clearly possess some reasoning processes. The paper gives evidence that they also have metacognitive knowledge, including ability to name skills and procedures to apply given a task. We explore this primarily in context of math reasoning, developing a prompt-guided interaction procedure to get a powerful LLM to assign sensible skill labels to math questions, followed by having it perform semantic clustering to obtain coarser families of skill labels. These coarse skill labels look interpretable to humans. To validate that these skill labels are meaningful and relevant to the LLM's reasoning processes we perform the following experiments. (a) We ask GPT-4 to assign skill labels to training questions in math datasets GSM8K and MATH. (b) When using an LLM to solve the test questions, we present it with the full list of skill labels and ask it to identify the skill needed. Then it is presented with randomly selected exemplar solved questions associated with that skill label. This improves accuracy on GSM8k and MATH for several strong LLMs, including code-assisted models. The methodology presented is domain-agnostic, even though this article applies it to math problems.

Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving

Abstract

The paper "Metacognitive Capabilities of LLMs: An Exploration in Mathematical Problem Solving" investigates whether LLMs possess metacognitive knowledge and how this knowledge can be harnessed to improve mathematical problem-solving. Specifically, it explores the LLMs' ability to recognize and label the skills required to solve mathematical questions and use these labels to enhance problem-solving accuracy across different tasks and LLMs.

Introduction

The paper situates itself within the landscape of natural language processing and mathematical reasoning, acknowledging that while LLMs have exhibited significant advancements in general and domain-specific tasks, their capabilities in mathematical problem-solving are still fraught with limitations. The core concept investigated is metacognition—defined as thinking about one's own thinking processes—which, if present in LLMs, could be used to improve their performance in solving math problems.

Methodology

The primary methodology adopted in the paper includes the following steps:

  1. Skill Labelling: Using a powerful LLM, e.g., GPT-4, the model is tasked with labeling each question in a math dataset with a specific skill required to solve it. The prompts used encourage the LLM to generate fine-grained and descriptive skill labels.
  2. Skill Clustering: After generating numerous skill labels, the same LLM performs semantic clustering to group these fine-grained skills into broader, more manageable categories. Each cluster of skills is assigned a descriptive label, thereby creating a "Skill Exemplar Repository."
  3. Inference: During the solving of test questions, the model uses the skill labels to identify relevant skills and retrieves corresponding exemplars from the repository. These exemplars are then used in-context to aid in solving the test questions.

Results

The experiments were conducted on various datasets, including the GSM8K dataset, which covers grade-school math problems, and the MATH dataset, known for its high difficulty. The findings demonstrate:

  • Accuracy Improvements: For the GSM8K dataset, the use of skill-exemplar-based in-context examples improved performance significantly over standard Chain-of-Thought (CoT) prompting methods, achieving an overall accuracy of 94.31% with self-consistency (maj@5) reaching 95.38%.
  • Enhanced Problem Solving on MATH Dataset: By utilizing skill-based in-context examples, the approach outperformed CoT prompting by 11.6% on average, indicating strong benefits across diverse mathematical topics such as Algebra, Geometry, and Probability.
  • Program-Based Enhancements: Integrating skill-based text examples with program-aided solutions (PAL) improved PAL performance by 7.52% on the MATH dataset.
  • Transferability: Skills identified by GPT-4 also improved performance of weaker LLMs like Mixtral 8x7B, and skills labeled on GSM8K were beneficial for other math word problem datasets, confirming the transferability and robustness of the skill knowledge across models and datasets.

Analysis

The paper identifies specific ways in which skill-based prompts enhance LLM performance:

  • Main Skill Success: The paper demonstrates that the primary advantage of skill-based prompting lies in its ability to reduce main skill errors, thereby allowing the model to focus more effectively on the pertinent mathematical concept.
  • Reduction of Secondary Errors: The approach also shows a reduction in secondary skill errors and calculation errors, highlighting a broader improvement in problem-solving accuracy.

Implications and Future Work

These findings have both practical and theoretical implications:

  • Practical Applications: Educators and developers of educational technologies can use similar methodologies to enhance the learning and teaching capabilities of AI systems, integrating metacognitive skill recognition and application.
  • Theoretical Insights: The research contributes to a deeper understanding of LLM metacognition, suggesting that these models possess a level of self-awareness regarding the skills they employ, which can be harnessed to improve their efficacy.
  • Future Developments: Future work may explore the application of this methodology to other problem-solving domains beyond mathematics, aiming to further generalize the findings. Additionally, the enhancement of skill annotation techniques and finer granularity of skills remains a promising area for exploration.

In conclusion, this paper provides compelling evidence that LLMs possess metacognitive knowledge that can be systematically leveraged to enhance their problem-solving abilities. The use of skill exemplars, validated through meticulous experimentation, underscores a novel approach to augmenting LLM capabilities in mathematical reasoning, with potential applications spanning far beyond this domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  2. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  5. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  6. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  7. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  8. Mathematical discoveries from program search with large language models. Nature, pages 1–3, 2023.
  9. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
  10. Evaluation of large language models for discovery of gene set function. Research Square, 2023.
  11. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
  12. Learning to solve arithmetic word problems with verb categorization. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 523–533, 2014.
  13. Reasoning about quantities in natural language. Transactions of the Association for Computational Linguistics, 3:1–13, 2015.
  14. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413, 2016.
  15. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
  16. Measuring mathematical problem solving with the math dataset, 2021.
  17. Solving math word problems by combining language models with symbolic solvers. arXiv preprint arXiv:2304.09102, 2023.
  18. JH Flavell. Metacognitive aspects of problem solving. In The Nature of Intelligence. Routledge, 1976.
  19. Intelligent tutoring systems. In M. Helander T. Landauer and P. Prabhu, editors, Handbook of Human Computer Interaction, pages 849–874. Elsevier Science, Amsterdam, 1997.
  20. The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learning, 2012.
  21. Chain-of-thought prompting elicits reasoning in large language models. arXiv, abs/2201.11903, 2022.
  22. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  23. How to train data-efficient llms. arXiv preprint arXiv:2402.09668, 2024.
  24. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR, 2023.
  25. Learning factors analysis: A general method for cognitive model evaluation and improvement. In M. Ikeda, K. Ashley, and T. Chan, editors, Intelligent Tutoring Systems (volume 4053 of Lec. Notes in Comp. Sci.), pages 164–175. 2006.
  26. Automatic discovery of cognitive skills to improve the prediction of student learning. Advances in neural information processing systems, 27, 2014.
  27. Emma Brunskill. Estimating prerequisite structure from noisy data. In Educational Data Mining, pages 217–222, 2011.
  28. Investigating active learning for concept prerequisite learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  29. Prerequisite relation learning for concepts in moocs. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1447–1456, 2017.
  30. Skill-it! a data-driven skills framework for understanding and training language models, 2023.
  31. Skill-mix: a flexible and expandable family of evaluations for ai models, 2023.
  32. A theory for emergence of complex skills in language models, 2023.
  33. Training verifiers to solve math word problems, 2021.
  34. Mixtral of experts, 2024.
  35. Self-consistency improves chain of thought reasoning in language models, 2023.
  36. Complexity-based prompting for multi-step reasoning, 2023.
  37. Latent skill discovery for chain-of-thought reasoning, 2023.
  38. A diverse corpus for evaluating and developing english math word problem solvers, 2021.
  39. MAWPS: A math word problem repository. In Kevin Knight, Ani Nenkova, and Owen Rambow, editors, Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1152–1157, San Diego, California, June 2016. Association for Computational Linguistics.
  40. Cumulative reasoning with large language models, 2023.
  41. Automatic model selection with large language models for reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 758–783, Singapore, December 2023. Association for Computational Linguistics.
  42. Progressive-hint prompting improves reasoning in large language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Aniket Didolkar (15 papers)
  2. Anirudh Goyal (93 papers)
  3. Nan Rosemary Ke (40 papers)
  4. Michal Valko (91 papers)
  5. Timothy Lillicrap (60 papers)
  6. Danilo Rezende (13 papers)
  7. Yoshua Bengio (601 papers)
  8. Michael Mozer (17 papers)
  9. Sanjeev Arora (93 papers)
  10. SiYuan Guo (20 papers)
Citations (9)