Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stream of Search (SoS): Learning to Search in Language (2404.03683v1)

Published 1 Apr 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how LLMs can be taught to search by representing the process of search in language, as a flattened string -- a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based LLM from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that LLMs can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
  2. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. arXiv preprint arXiv:2204.01171, 2022.
  3. The pitfalls of next-token prediction. arXiv preprint arXiv:2403.06963, 2024.
  4. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  5. Countdown. Countdown (game show). https://en.wikipedia.org/wiki/Countdown_(game_show), 2024. [Online; accessed 29-March-2024].
  6. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  7. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208, 1971.
  8. Strategic reasoning with language models. arXiv preprint arXiv:2305.19165, 2023.
  9. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  10. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  11. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
  12. Y LeCun. Do large language models need sensory grounding for meaning and understanding. In Workshop on Philosophy of Deep Learning, NYU Center for Mind, Brain, and Consciousness and the Columbia Center for Science and Society, 2023.
  13. Beyond a*: Better planning with transformers via search dynamics bootstrapping. arXiv preprint arXiv:2402.14083, 2024.
  14. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
  15. Evaluating cognitive maps and planning in large language models with cogeval. Advances in Neural Information Processing Systems, 36, 2024.
  16. Understanding the capabilities of large language models for automated planning. arXiv preprint arXiv:2305.16151, 2023.
  17. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  627–635. JMLR Workshop and Conference Proceedings, 2011.
  18. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
  19. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379, 2023.
  20. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
  21. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
  22. On the self-verification limitations of large language models on reasoning and planning tasks. arXiv preprint arXiv:2402.08115, 2024.
  23. Reinforcement learning: An introduction. MIT press, 2018.
  24. On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36, 2024.
  25. Chain of thought imitation with procedure cloning. Advances in Neural Information Processing Systems, 35:36366–36381, 2022.
  26. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  27. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
  28. Fine-tuning language models with advantage-induced policy alignment. arXiv preprint arXiv:2306.02231, 2023.
Citations (11)

Summary

  • The paper introduces the innovative Stream of Search (SoS) framework that integrates search operations in a unified language to enhance learning and problem-solving.
  • The paper demonstrates a 25% accuracy improvement on the Countdown problem when models are trained on diverse search trajectories instead of optimal paths.
  • The paper employs policy improvement techniques like APA and STaR, enabling models to solve an additional 36% of challenging problems through self-improvement.

Exploring the Stream of Search: A Framework for Learning to Search within LLMs

Introduction to Stream of Search

The essence of problem-solving through LLMs often neglects an important aspect integral to human learning and creativity: the ability to explore, make mistakes, and learn from them. Most models are trained on a clean, mistake-free data diet, limiting their ability to preemptively recognize errors or explore alternative solutions. This paper proposes a novel framework, Stream of Search (SoS), which encapsulates the idea of teaching LLMs the art of search and backtracking through a unified language. The method is demonstrated on the Countdown problem, showcasing a significant improvement in solving capabilities over models trained solely on optimal paths.

Unified Language for Search

At the heart of the SoS framework is the systematic representation of search strategies in a unified language. This encompasses key search operations like exploration, backtracking, pruning, and more. By embodying these operations in language, the paper opens the door to training models that can autonomously navigate through problem spaces, engaging with different strategies and potentially inventing new ones. Such a unified language for search not only enhances a model's problem-solving toolkit but also enriches its ability to think and learn in a more human-like manner.

Training and Evaluation

The Countdown problem, chosen for its combination of simplicity and complexity, served as the proving ground for SoS. A transformer-based model was trained from scratch on a dataset of search trajectories generated by heuristic solvers employing diverse strategies. Remarkably, compared to models trained on optimal paths, the SoS pretraining demonstrated a 25% increase in search accuracy. This improvement not only speaks to the efficacy of exposing models to the intricate process of search and decision-making but also highlights the potential for models to self-improve and adapt over time.

Policy Improvement Techniques

Building upon SoS, the paper investigates the model's capacity for self-improvement through two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The resulting finetuned models exhibited an impressive ability to solve an additional 36% of problems previously unsolved, including those beyond the reach of the heuristic solvers used for initial training. This achievement underscores the profound potential of LLMs to transcend their initial training limits through the strategic application of policy improvement methods.

Implications and Future Directions

The Stream of Search framework reshapes our understanding of the capabilities of LLMs in planning, problem-solving, and learning. By training models to engage with the messy, exploratory process of search, we can unlock more dynamic and versatile problem-solving abilities. This research not only provides a tangible step towards equipping LMs with the tools for internal search and discovery but also lays the groundwork for future developments in AI that can learn, adapt, and innovate in more human-like ways.

Furthermore, the implications for practical applications are vast, ranging from enhanced problem-solving in specific domains to the development of more generalized AI capable of tackling a broader spectrum of challenges. As we continue to push the boundaries of what LLMs can achieve, frameworks like SoS will be instrumental in guiding their evolution towards more sophisticated and creative forms of intelligence.

Concluding Thoughts

The Stream of Search framework marks a significant advancement in the field of LLM research.

By embedding the intricacies of search within the language, the framework opens new avenues for models to learn, grow, and innovate. As we look towards the future, the potential for models trained under this framework to discover entirely new search strategies or solve problems that have long evaded algorithmic solutions serves as a testament to the untapped potential residing within LLMs.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com