Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models (2405.00402v1)

Published 1 May 2024 in cs.CL
Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models

Abstract: The alignments of reasoning abilities between smaller and larger LLMs are largely conducted via Supervised Fine-Tuning (SFT) using demonstrations generated from robust LLMs. Although these approaches deliver more performant models, they do not show sufficiently strong generalization ability as the training only relies on the provided demonstrations. In this paper, we propose the Self-refine Instruction-tuning method that elicits Smaller LLMs to self-refine their abilities. Our approach is based on a two-stage process, where reasoning abilities are first transferred between LLMs and Small LLMs (SLMs) via Instruction-tuning on demonstrations provided by LLMs, and then the instructed models Self-refine their abilities through preference optimization strategies. In particular, the second phase operates refinement heuristics based on the Direct Preference Optimization algorithm, where the SLMs are elicited to deliver a series of reasoning paths by automatically sampling the generated responses and providing rewards using ground truths from the LLMs. Results obtained on commonsense and math reasoning tasks show that this approach significantly outperforms Instruction-tuning in both in-domain and out-domain scenarios, aligning the reasoning abilities of Smaller and Larger LLMs.

Exploring Refinement: Advancing Small LLMs through Self-refine Instruction-tuning

An Overview of the Problem

LLMs come in various sizes: from small, nimble variants to colossal, data-hungry leviathans. These LLMs like GPT-3.5, have shown an impressive ability to handle complex reasoning tasks by breaking them down into manageable, sequential thought processes—a tactic known as Chain-of-Thought (CoT) prompting. However, these larger models face adoption hurdles due to their size and computational costs.

In contrast, Small LLMs (SLMs) are easier to handle but traditionally lag in performing complex cognitive tasks without explicit step-by-step guidance. The paper I discuss here introduces an innovative approach called Self-refine Instruction-tuning. This methodology aims to enhance the reasoning capability of SLMs by learning from the 'thought process' exhibited by LLMs, followed by a self-refinement stage to further improve their understanding.

The Methodology Insight

The research paper presents a two-part method to boost the reasoning powers of smaller LLMs using a system that involves both instruction and self-refinement:

Phase 1: Instruction-tuning

The initial phase is all about setting the stage. The SLMs are instructed using demonstrative examples derived from LLMs. These examples showcase how to solve specific problems step-by-step, aligning student models (SLMs) closer to their teacher models (LLMs) in terms of reasoning paths.

Phase 2: Self-refinement via Direct Preference Optimization

Once equipped with foundational reasoning skills from LLMs, SLMs enter a self-refinement phase. This stage harnesses the strength of Direct Preference Optimization (DPO)—a strategy rooted in reinforcement learning—to fine-tune their problem-solving abilities. The refinement involves the model evaluating its own generated responses against set criteria or 'preferences,' encouraging iterative self-improvement without constant supervision.

Standout Results and Practical Implications

The paper quantitatively demonstrates that Self-refine Instruction-tuning convincingly outperforms traditional instruction-tuning across various reasoning tasks both in-scenario (aligned with training examples) and out-scenario (where the tasks diverge from direct training examples). This indicates not just improved reasoning skills but also an enhanced ability to generalize these skills to varied contexts—a significant leap for deploying SLMs in real-world applications where flexibility and adaptability are crucial.

What's Next in AI?

The method proposes a systematic way to export high-quality reasoning capabilities from more powerful models to less demanding ones, potentially democratizing access to high-level AI reasoning. Looking forward, this methodology could lead to broader adoption of AI in diverse fields, from enhancing educational tools to powering intuitive user interfaces in software applications.

The continuous evolution of this self-refinement process may also prompt more robust forms of AI that can learn and adapt in live environments, ultimately requiring less human intervention in training sophisticated models.

The Big Picture

Self-refine Instruction-tuning appears as a promising avenue to bridge the functionality gap between LLMs and SLMs. By leveraging the sophisticated reasoning stratagems of their larger counterparts, smaller models can potentially serve more complex roles than previously deemed feasible, all while maintaining operational and resource efficiency.

This research showcases a practical roadmap for enhancing the generalization capability of AI without continually expanding the model size, steering us toward a future where smaller, smarter models could become ubiquitous collaborators in cognitive tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. A general theoretical paradigm to understand learning from human preferences.
  2. Piqa: Reasoning about physical commonsense in natural language.
  3. Language models are few-shot learners.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4.
  5. Training verifiers to solve math word problems. ArXiv, abs/2110.14168.
  6. Qlora: Efficient finetuning of quantized llms.
  7. Complexity-based prompting for multi-step reasoning.
  8. Pal: Program-aided language models.
  9. Vedant Gaur and Nikunj Saunshi. 2023. Reasoning in large language models through symbolic math word problems. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5889–5903, Toronto, Canada. Association for Computational Linguistics.
  10. Measuring massive multitask language understanding.
  11. Measuring mathematical problem solving with the math dataset.
  12. Large language models are reasoning teachers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14852–14882, Toronto, Canada. Association for Computational Linguistics.
  13. Mistral 7b.
  14. Mixtral of experts.
  15. Symbolic chain-of-thought distillation: Small models can also “think” step-by-step. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2665–2679, Toronto, Canada. Association for Computational Linguistics.
  16. Making language models better reasoners with step-aware verifier. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5315–5333, Toronto, Canada. Association for Computational Linguistics.
  17. Evaluating the logical reasoning ability of chatgpt and gpt-4.
  18. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct.
  19. Reft: Reasoning with reinforced fine-tuning.
  20. Teaching small language models to reason. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1773–1781, Toronto, Canada. Association for Computational Linguistics.
  21. Can a suit of armor conduct electricity? a new dataset for open book question answering.
  22. OpenAI. 2023. Gpt-4 technical report.
  23. Training language models to follow instructions with human feedback.
  24. Refiner: Reasoning feedback on intermediate representations.
  25. Direct preference optimization: Your language model is secretly a reward model.
  26. Leonardo Ranaldi and Andre Freitas. 2024. Aligning large and small language models via chain-of-thought reasoning. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1812–1827, St. Julian’s, Malta. Association for Computational Linguistics.
  27. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752, Lisbon, Portugal. Association for Computational Linguistics.
  28. Social IQa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, Hong Kong, China. Association for Computational Linguistics.
  29. Proximal policy optimization algorithms.
  30. Distilling reasoning capabilities into smaller language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7059–7073, Toronto, Canada. Association for Computational Linguistics.
  31. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  32. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  33. Llama 2: Open foundation and fine-tuned chat models.
  34. Solving math word problems with process- and outcome-based feedback.
  35. Making large language models better reasoners with alignment.
  36. Self-consistency improves chain of thought reasoning in language models.
  37. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  38. Democratizing reasoning ability: Tailored learning from large language model. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1948–1966, Singapore. Association for Computational Linguistics.
  39. Emergent abilities of large language models.
  40. Chain-of-thought prompting elicits reasoning in large language models.
  41. Wizardlm: Empowering large language models to follow complex instructions.
  42. Mammoth: Building math generalist models through hybrid instruction tuning.
  43. Star: Bootstrapping reasoning with reasoning.
  44. Interpretable math word problem solution generation via step-by-step planning.
  45. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Leonardo Ranaldi (18 papers)
  2. Andrè Freitas (3 papers)
Citations (3)