Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 90 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 41 tok/s
GPT-5 High 42 tok/s Pro
GPT-4o 109 tok/s
GPT OSS 120B 477 tok/s Pro
Kimi K2 222 tok/s Pro
2000 character limit reached

Hint-before-Solving Prompting: Guiding LLMs to Effectively Utilize Encoded Knowledge (2402.14310v1)

Published 22 Feb 2024 in cs.CL

Abstract: LLMs have recently showcased remarkable generalizability in various domains. Despite their extensive knowledge, LLMs still face challenges in efficiently utilizing encoded knowledge to develop accurate and logical reasoning processes. To mitigate this problem, we introduced Hint-before-Solving Prompting (HSP), which guides the model to generate hints (e.g., specific knowledge or key ideas) for solving the problem and then generate solutions containing intermediate reasoning steps. Since HSP is orthogonal to prompting methods (e.g., Chain-of-Thought (CoT)), we applied HSP to CoT, Least-to-Most, Plan-and-Solve, and Standard promptings. The results of extensive experiments on 6 reasoning benchmarks and 4 open-source LLMs demonstrate that HSP can effectively improve the accuracy of reasoning tasks: (1) By applying high-quality hint-enhanced HSP to CoT prompting, Llama2-70B-Chat shows an improvement of 9.7. (2) Beyond exploring training-free LLM capabilities, we built the HSPMATH dataset based on HSP and fine-tuned Llemma-7B, reaching 64.3 accuracy, surpassing GPT-3.5 and WizardMath-13B. We make our code and dataset publicly available at \url{https://github.com/jinlanfu/HSP}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Llemma: An open language model for mathematics. CoRR, abs/2310.10631.
  2. Graph of thoughts: Solving elaborate problems with large language models. CoRR, abs/2308.09687.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Palm: Scaling language modeling with pathways. J. Mach. Learn. Res., 24:240:1–240:113.
  5. Training verifiers to solve math word problems. arXiv e-prints, pages arXiv–2110.
  6. Complexity-based prompting for multi-step reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  7. PAL: program-aided language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 10764–10799. PMLR.
  8. Did aristotle use a laptop? A question answering benchmark with implicit reasoning strategies. Trans. Assoc. Comput. Linguistics, 9:346–361.
  9. Rethinking with retrieval: Faithful large language model inference. CoRR, abs/2301.00303.
  10. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  11. Measuring mathematical problem solving with the MATH dataset. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual.
  12. Training compute-optimal large language models. CoRR, abs/2203.15556.
  13. Mathprompter: Mathematical reasoning using large language models. In Proceedings of the The 61st Annual Meeting of the Association for Computational Linguistics: Industry Track, ACL 2023, Toronto, Canada, July 9-14, 2023, pages 37–42. Association for Computational Linguistics.
  14. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  15. Solving quantitative reasoning problems with language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  16. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 158–167. Association for Computational Linguistics.
  17. Plan, verify and switch: Integrated reasoning with diverse x-of-thoughts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 2807–2822. Association for Computational Linguistics.
  18. A survey of deep learning for mathematical reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 14605–14631. Association for Computational Linguistics.
  19. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. CoRR, abs/2308.09583.
  20. Faithful chain-of-thought reasoning. CoRR, abs/2301.13379.
  21. A diverse corpus for evaluating and developing english math word problem solvers. CoRR, abs/2106.15772.
  22. Mistral AI Team. 2023. Mixtral of experts. https://mistral.ai/news/mixtral-of-experts/. Accessed: 2023-12-26.
  23. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  24. Prompting contrastive explanations for commonsense reasoning tasks. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 4179–4192. Association for Computational Linguistics.
  25. Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. CoRR, abs/1608.01413.
  26. Commonsense reasoning for natural language processing. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, ACL 2020, Online, July 5, 2020, pages 27–33. Association for Computational Linguistics.
  27. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  28. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  29. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  31. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 2609–2634. Association for Computational Linguistics.
  32. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  34. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  35. Leandojo: Theorem proving with retrieval-augmented language models. CoRR, abs/2306.15626.
  36. Tree of thoughts: Deliberate problem solving with large language models. CoRR, abs/2305.10601.
  37. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  38. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 15135–15153. Association for Computational Linguistics.
  39. Metamath: Bootstrap your own mathematical questions for large language models. CoRR, abs/2309.12284.
  40. Scaling relationship on learning mathematical reasoning with large language models. CoRR, abs/2308.01825.
  41. Mammoth: Building math generalist models through hybrid instruction tuning. CoRR, abs/2309.05653.
  42. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5823–5840. Association for Computational Linguistics.
  43. A survey of large language models. CoRR, abs/2303.18223.
  44. Progressive-hint prompting improves reasoning in large language models. CoRR, abs/2304.09797.
  45. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
  46. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  47. Adaptive-solver framework for dynamic strategy selection in large language model reasoning. CoRR, abs/2310.01446.
Citations (2)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper presents the Hint-before-Solving Prompting framework, which instructs LLMs to generate hints prior to solving to enhance reasoning accuracy.
  • It empirically demonstrates a 9.7% relative accuracy boost in CoT settings and significant improvement through fine-tuning on the HSPMATH dataset.
  • The study highlights that the method’s efficiency scales with model capacity and hint quality, enabling modular integration of external cues for better performance.

Hint-before-Solving Prompting: Enhancing LLM Knowledge Utilization

The paper "Hint-before-Solving Prompting: Guiding LLMs to Effectively Utilize Encoded Knowledge" (2402.14310) presents a comprehensive paper on explicitly guiding LLMs to leverage their internalized knowledge during reasoning tasks. The authors introduce Hint-before-Solving Prompting (HSP), a framework whereby LLMs are first instructed to generate or consider task-specific hints prior to producing solutions. This mechanism is explored in conjunction with established reasoning prompt paradigms such as Chain-of-Thought (CoT), Least-to-Most (LtM), Plan-and-Solve (PS), and standard direct-answer prompting.

Methodological Overview

HSP is formulated as an orthogonal extension to existing prompting methods: after presenting a query, the model–rather than generating a direct answer or reasoning sequence immediately–is prompted to first output a contextually relevant hint. The hint is intended to focus the model’s attention on key knowledge, strategies, or decompositional cues pertinent to problem-solving. In the experimental formulation, HSP is instantiated either as a one-stage approach (where hints and solutions are combined in a single generation) or a two-stage variant (HSP2), in which the model outputs the hint and solution sequentially.

Empirical evaluation encompasses six diverse reasoning benchmarks (mathematical and commonsense) and multiple open-source LLMs spanning parameter scales from 7B to 70B, as well as a mixture-of-experts model (Mixtral-8x7B-Instruct). A notable contribution is the introduction of the HSPMATH dataset containing 75,000 hint-enhanced samples for fine-tuning.

Key Findings and Numerical Results

The core empirical claims are supported by extensive ablation and comparative studies:

  • HSP consistently improves accuracy in standard and CoT prompting frameworks. For instance, pairing high-quality, externally-generated hints with CoT on Llama2-70B-Chat yields a 9.7% relative accuracy boost across reasoning tasks.
  • Supervised fine-tuning on hint-augmented datasets produces nontrivial improvements over both baseline and established closed-source models. Llemma-7B fine-tuned on HSPMATH achieves an accuracy of 64.3% on GSM8K, outperforming GPT-3.5 (57.1%) and WizardMath-13B (63.9%) under equivalent settings.
  • Effectiveness scales with model capacity: Larger models (13B, 70B) benefit more from HSP than smaller ones, indicating a synergy between model size and the ability to self-generate salient hints.
  • On challenging tasks (e.g. MATH dataset), HSP’s utility is modulated by LLM competence. For sufficiently capable models (Mixtral-56B), HSP improves performance even in the self-consistency regime, particularly for complex question types or difficulty levels. Lower-capacity models struggle to generate useful hints without external intervention.
  • Hint quality is a critical factor. Incorporating externally generated hints from a more capable model (GPT-4) leads to further improvements. This effect is most pronounced for weaker LLMs, narrowing the performance gap with stronger models.
  • Solutions generated after hinting are generally more concise and focused, especially in mathematical domains, suggesting more efficient internal reasoning.

Contradictory and Nuanced Observations

Although HSP is beneficial for standard and CoT promptings, its integration with planning- or decomposition-centric methods like Plan-and-Solve and Least-to-Most produces inconsistent or marginal improvements. The authors attribute this to the interaction between externally provided hints and the internal planning mechanism, which sometimes leads to misaligned or redundant reasoning steps.

On the most demanding tasks, such as advanced mathematical problem solving, not all LLMs are able to autonomously generate effective hints—highlighting a competence threshold for HSP’s self-directed variant. Nevertheless, when access to high-quality, externally sourced hints is available, even smaller models realize substantial gains.

Practical and Theoretical Implications

This work introduces a fundamentally modular approach to enhancing LLM reasoning by operationalizing the intermediate use of hints. While prior work has pursued external retrieval or post-hoc verification, the HSP paradigm demonstrates that structured, contextually-attuned scaffolding at the prompt level can substantially bridge the reasoning performance gap, with minimal modification to architecture or training regime. The release of HSPMATH provides a new resource for benchmarking fine-tuning strategies in mathematical reasoning.

From a methodological standpoint, two implications are salient:

  • Prompt-level modularization: By separating hint generation from reasoning, HSP facilitates the integration of external knowledge sources and the potential for pipeline architectures (e.g., specialist hint generators feeding into LLM solvers), enabling new multi-agent or composite reasoning systems.
  • Adaptive scaling: Given the observed model-size dependency, HSP can be selectively deployed for tasks and models where reasoning generalization is not yet robust. For more capable LLMs, HSP can be used to expose weaknesses in internal knowledge representation, guiding targeted fine-tuning.

Future Directions

Several avenues arise from this paper:

  • External hint generation at scale: Automating high-quality hint provision (perhaps via more advanced LLMs or retrieval-augmented modules) could further democratize robust reasoning, especially for lightweight models or resource-constrained deployments.
  • Integration with planning/graph-based reasoning paradigms: Adapting HSP to interleave with more structured problem decompositions may address the observed integration inefficiencies, especially for multi-step algorithmic or scientific reasoning.
  • Adaptive prompting: Dynamically invoking hint generation only for queries triggering indicators of model uncertainty or multi-hop requirements could optimize resource utilization and latency in production systems.

Conclusion

Hint-before-Solving Prompting substantiates the claim that explicit intermediate supervision—via contextually focused hints—facilitates more precise and efficient reasoning in LLMs. The method’s compatibility with standard prompting, demonstrable empirical gains, and straightforward integration make it a promising tool for both research and practical deployments in knowledge-intensive domains. Its efficacy is closely tied to both model capacity and hint quality, underscoring the interplay between prompt engineering and model architecture in the ongoing advancement of general-purpose LLMs.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.