Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning (2406.14283v4)

Published 20 Jun 2024 in cs.AI

Abstract: LLMs have demonstrated impressive capability in many natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function for estimating expected future rewards, our Q* can effectively guide LLMs to select the most promising next reasoning step without fine-tuning LLMs for the current task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP demonstrate the superiority of our method, contributing to improving the reasoning performance of existing open-source LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chaojie Wang (28 papers)
  2. Yanchen Deng (10 papers)
  3. Shuicheng Yan (275 papers)
  4. Zhiyi Lyu (2 papers)
  5. Liang Zeng (31 papers)
  6. Jujie He (6 papers)
  7. Bo An (128 papers)
Citations (20)

Summary

Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

Overview

This paper introduces Q*, a novel framework devised to enhance the multi-step reasoning capabilities of LLMs through deliberative planning. Multi-step reasoning is critically essential for tasks such as solving math word problems and generating code. However, the auto-regressive generation mechanism often renders LLMs susceptible to errors, hallucinations, and inconsistencies, as this method tends to incrementally build outputs which may propagate initial errors to subsequent steps.

Key Contributions

The authors propose Q* to alleviate the inadequacies of standard LLMs in tasks requiring sequential reasoning. The main innovations and contributions of this work can be summarized as follows:

  1. Formalization as MDP: Multi-step reasoning for LLMs is cast as a Markov Decision Process (MDP), where each state represents the current input prompt concatenated with the reasoning trace so far.
  2. Estimation of Optimal Q-values: Various strategies are proposed for estimating the optimal Q-values of state-action pairs, including offline reinforcement learning, rollouts, and leveraging stronger LLMs.
  3. Deliberative Planning using A: Q integrates A* search to guide LLMs by selecting the most promising next reasoning step, leveraging plug-and-play Q-value models as heuristic functions.
  4. Experimental Validation: Extensive experiments on benchmark datasets GSM8K, MATH, and MBPP validate the superior performance of Q* over baselines.

Technical Approach

The framework utilizes A* search for heuristic-based exploration of reasoning paths. The Q* algorithm estimates the expected rewards (Q-values) for different steps within the reasoning process, which guide the model in selecting the most promising subsequent steps. These Q-values are calculated without domain-specific heuristics or laborious task-specific fine-tuning, thus making Q* generalizable across diverse multi-step reasoning tasks.

Formalization of Multi-step Reasoning as MDP

In the MDP formalization:

  • States sts_t are composed of the input prompt and the steps of reasoning completed up to time tt.
  • Actions ata_t denote the subsequent step in reasoning.
  • Rewards R\mathcal{R} determine how well the task is solved, applied in a delayed manner based on the final solution's accuracy.

Estimation Methods for Optimal Q-values

To derive a reliable heuristic function for real-time application, the paper explores:

  • Offline Reinforcement Learning: By iterative Fitted Q-iteration, Q-values are refined incrementally using the BeLLMan equation applied to a dataset of previously sampled trajectories.
  • Learning from Rollout: By generating numerous rollouts from each intermediate state, the highest-rewarding trajectories are utilized to compute Q-value labels.
  • Completion with Stronger LLMs: State-action pairs are enriched by completing trajectories using more capable LLMs like GPT-4.

Experimental Results

GSM8K

For math word problem-solving in the GSM8K dataset:

  • Q* outperformed other approaches by leveraging a process reward model (PRM) to score intermediate steps and a Q-value model (QVM) to guide reasoning.
  • Notably, Q* achieved an accuracy of 80.8% with Llama-2-7b, outperforming baselines that include Best-of-N and PPO-based fine-tuning.

MATH

In the MATH dataset, using Llama-2-7b and DeepSeek-Math-7b models:

  • Q* exhibited significant accuracy improvement over baseline methods, reaching up to 55.4%.
  • These results illustrate the efficacy of Q* in complex, multi-step mathematical reasoning beyond the capacities of the underlying LLM alone.

MBPP

For code generation in the MBPP dataset:

  • Q* demonstrated a marked performance enhancement, achieving 77.0% accuracy using the CodeQwen1.5-7b-Chat model.
  • The heuristic-led approach effectively managed the multi-step nature of code generation, surpassing conventional Best-of-N methods.

Implications and Future Prospects

The Q* framework's generalizability and efficiency present a notable advancement in the domain of LLMs for multi-step reasoning tasks. By bypassing laborious task-specific fine-tuning and extensive reliance on domain-specific knowledge, Q* offers a scalable solution adaptable across various domains.

Future developments could explore:

  • Extending Q* to other domains necessitating multi-step logical reasoning, such as scientific research synthesis or financial modeling.
  • Enhancing the efficiency of Q-value estimation to further reduce computational overhead.
  • Integrating real-time feedback mechanisms to dynamically adapt and improve Q-values during deployment.

In conclusion, Q* significantly improves the competency of LLMs in addressing multi-step reasoning problems, offering a promising direction for further advancements in AI-driven problem-solving frameworks.

Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com