Stream of Search (SoS): Learning to Search in Language (2404.03683v1)
Abstract: LLMs are rarely shown fruitful mistakes while training. They then struggle to look beyond the next token, suffering from a snowballing of errors and struggling to predict the consequence of their actions several steps ahead. In this paper, we show how LLMs can be taught to search by representing the process of search in language, as a flattened string -- a stream of search (SoS). We propose a unified language for search that captures an array of different symbolic search strategies. We demonstrate our approach using the simple yet difficult game of Countdown, where the goal is to combine input numbers with arithmetic operations to reach a target number. We pretrain a transformer-based LLM from scratch on a dataset of streams of search generated by heuristic solvers. We find that SoS pretraining increases search accuracy by 25% over models trained to predict only the optimal search trajectory. We further finetune this model with two policy improvement methods: Advantage-Induced Policy Alignment (APA) and Self-Taught Reasoner (STaR). The finetuned SoS models solve 36% of previously unsolved problems, including problems that cannot be solved by any of the heuristic solvers. Our results indicate that LLMs can learn to solve problems via search, self-improve to flexibly use different search strategies, and potentially discover new ones.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022.
- Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. arXiv preprint arXiv:2204.01171, 2022.
- The pitfalls of next-token prediction. arXiv preprint arXiv:2403.06963, 2024.
- Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Countdown. Countdown (game show). https://en.wikipedia.org/wiki/Countdown_(game_show), 2024. [Online; accessed 29-March-2024].
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208, 1971.
- Strategic reasoning with language models. arXiv preprint arXiv:2305.19165, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
- Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798, 2023.
- Y LeCun. Do large language models need sensory grounding for meaning and understanding. In Workshop on Philosophy of Deep Learning, NYU Center for Mind, Brain, and Consciousness and the Columbia Center for Science and Society, 2023.
- Beyond a*: Better planning with transformers via search dynamics bootstrapping. arXiv preprint arXiv:2402.14083, 2024.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
- Evaluating cognitive maps and planning in large language models with cogeval. Advances in Neural Information Processing Systems, 36, 2024.
- Understanding the capabilities of large language models for automated planning. arXiv preprint arXiv:2305.16151, 2023.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627–635. JMLR Workshop and Conference Proceedings, 2011.
- Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020.
- Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint arXiv:2308.10379, 2023.
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144, 2018.
- On the self-verification limitations of large language models on reasoning and planning tasks. arXiv preprint arXiv:2402.08115, 2024.
- Reinforcement learning: An introduction. MIT press, 2018.
- On the planning abilities of large language models-a critical investigation. Advances in Neural Information Processing Systems, 36, 2024.
- Chain of thought imitation with procedure cloning. Advances in Neural Information Processing Systems, 35:36366–36381, 2022.
- Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
- Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
- Fine-tuning language models with advantage-induced policy alignment. arXiv preprint arXiv:2306.02231, 2023.