Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Alphazero-like Tree-Search can Guide Large Language Model Decoding and Training (2309.17179v2)

Published 29 Sep 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Recent works like Tree-of-Thought (ToT) and Reasoning via Planning (RAP) aim to augment the reasoning capabilities of LLMs by using tree-search algorithms to guide multi-step reasoning. These methods rely on prompting a pre-trained model to serve as a value function and focus on problems with low search depth. As a result, these methods will not work in domains where the pre-trained LLM does not have enough knowledge to serve as an effective value function or in domains that require long-horizon planning. To address these limitations, we present an AlphaZero-like tree-search learning framework for LLMs (termed TS-LLM), systematically illustrating how tree-search with a learned value function can guide LLM decoding. TS-LLM distinguishes itself in two key ways. (1) Leveraging a learned value function and AlphaZero-like algorithms, our approach can be generally adaptable to a wide range of tasks, LLMs of any size, and tasks of varying search depths. (2) Our approach can guide LLMs during both inference and training, iteratively improving the LLM. Empirical results across reasoning, planning, alignment, and decision-making tasks show that TS-LLM outperforms existing approaches and can handle trees with a depth of 64.

AlphaZero-Like Tree-Search Can Guide LLM Decoding and Training

The research paper under discussion presents a novel framework termed TS-LLM, which integrates AlphaZero-like tree-search methods into the decoding and training processes of LLMs. This work seeks to enhance existing models' capabilities, particularly their reasoning, planning, alignment, and decision-making processes, by addressing the limitations seen in previous approaches such as Tree-of-Thought (ToT) and Reasoning via Planning (RAP).

Core Contributions

  1. Integration of Tree-Search Algorithms with LLMs: TS-LLM employs tree-search algorithms inspired by AlphaZero, using a learned value function to guide the LLM both in inference and training phases. This is a departure from prior methods that relied heavily on pre-trained LLMs' ability to act as value functions, which limited their applicability to tasks of limited search depth.
  2. Scalability and Versatility: The framework supports a wide array of tasks and model sizes, capable of operating on search trees with a significant depth of 64. This allows TS-LLM to handle complex tasks requiring extensive analytical depth and long-term planning.
  3. Enhanced Training Paradigm: TS-LLM goes beyond mere inference improvement, positing a novel training paradigm where improved trajectories from tree search guide further training, combining policy distillation, and value function learning.

Empirical Evaluation

The paper reports robust empirical results affirming TS-LLM’s superiority over existing strategies in several domains. Notably, it demonstrates notable advancements in complex tasks such as reasoning, where tree-search algorithms provide a pronounced edge over traditional methods like depth-first or breadth-first searches.

Numerical Results and Claims

The research notably claims that TS-LLM can outperform existing baselines in domains such as planning and decision-making. Numerical evaluations indicate its capability to conduct deeper searches, leading to better performance on tasks with varying complexities. The framework's empirical outcomes suggest a scalable and efficient improvement over conventional LLM methodologies.

Theoretical and Practical Implications

Theoretically, this work proposes a paradigm shift by systematically incorporating well-researched tree-search algorithms from areas like board games into LLMs, which is primarily dominated by gradient-based learning. Practically, TS-LLM stands to impact various domains where LLMs are applied, driving improved performance through its enhanced reasoning capabilities.

Future Directions

The integration of tree-search methods into LLMs opens a range of interesting future explorations:

  • Algorithmic Refinements: Investigating more sophisticated tree-search algorithms could yield further performance improvements, especially in complex reasoning tasks.
  • Scaling: Addressing computational overheads associated with tree-search in LLMs might facilitate application to even larger models and datasets.
  • Generalization Across Domains: Assessing TS-LLM's effectiveness across a broader array of tasks, including those outside traditional LLM applications.

In summary, this work signifies an important step in enhancing LLMs' capabilities, with potential benefits spanning various AI applications. Through the innovative use of tree-search algorithms, it challenges the community to rethink traditional training and inference strategies within machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
  2. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  3. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  5. Rémi Coulom. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on computers and games, pp. 72–83. Springer, 2006.
  6. Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712, 2022.
  7. Dahoas. Synthetic-instruct-gptj-pairwise. https://huggingface.co/datasets/Dahoas/synthetic-instruct-gptj-pairwise.
  8. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023.
  9. Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200, 2023.
  10. Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
  11. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  12. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  13. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  14. Learning and planning in complex action spaces. In International Conference on Machine Learning, pp. 4476–4486. PMLR, 2021.
  15. Maieutic prompting: Logically consistent reasoning with recursive explanations. arXiv preprint arXiv:2205.11822, 2022.
  16. Bandit based monte-carlo planning. In European conference on machine learning, pp.  282–293. Springer, 2006.
  17. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  18. Machine translation decoding beyond beam search. arXiv preprint arXiv:2104.05336, 2021.
  19. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  20. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  21. Making ppo even better: Value-guided monte-carlo tree search decoding, 2023.
  22. Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023.
  23. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  24. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  25. Nadia Matulewicz. Inductive program synthesis through using monte carlo tree search guided by a heuristic-based loss function. 2022.
  26. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  27. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  28. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  29. Christopher D Rosin. Multi-armed bandits with episode context. Annals of Mathematics and Artificial Intelligence, 61(3):203–230, 2011.
  30. Abulhair Saparov and He He. Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. arXiv preprint arXiv:2210.01240, 2022.
  31. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  32. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  33. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  34. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017a.
  35. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017b.
  36. Reinforcement learning: An introduction. MIT press, 2018.
  37. Galactica: A large language model for science. arXiv preprint arXiv:2211.09085, 2022.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  39. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  40. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
  41. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  42. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  43. Decomposition enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633, 2023.
  44. Haotian Xu. No train still gain. unleash mathematical reasoning of large language models with monte carlo tree search guided by energy function. arXiv preprint arXiv:2309.03224, 2023.
  45. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  46. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023a.
  47. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023b.
  48. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
  49. Planning with large language models for code generation. arXiv preprint arXiv:2303.05510, 2023.
  50. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  51. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
  52. Solving math word problem via cooperative reasoning induced language models. arXiv preprint arXiv:2210.16257, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xidong Feng (17 papers)
  2. Ziyu Wan (32 papers)
  3. Muning Wen (20 papers)
  4. Ying Wen (75 papers)
  5. Weinan Zhang (322 papers)
  6. Jun Wang (990 papers)
  7. Stephen Marcus McAleer (6 papers)
Citations (67)