Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Reason via Program Generation, Emulation, and Search (2405.16337v3)

Published 25 May 2024 in cs.CL and cs.AI
Learning to Reason via Program Generation, Emulation, and Search

Abstract: Program synthesis with LLMs (LMs) has unlocked a large set of reasoning abilities; code-tuned LMs have proven adept at generating programs that solve a wide variety of algorithmic symbolic manipulation tasks (e.g. word concatenation). However, not all reasoning tasks are easily expressible as code, e.g. tasks involving commonsense reasoning, moral decision-making, and sarcasm understanding. Our goal is to extend an LM's program synthesis skills to such tasks and evaluate the results via pseudo-programs, namely Python programs where some leaf function calls are left undefined. To that end, we propose, Code Generation and Emulated EXecution (CoGEX). CoGEX works by (1) training LMs to generate pseudo-programs, (2) teaching them to emulate their generated program's execution, including those leaf functions, allowing the LM's knowledge to fill in the execution gaps; and (3) using them to search over many programs to find an optimal one. To adapt the CoGEX model to a new task, we introduce a method for performing program search to find a single program whose pseudo-execution yields optimal performance when applied to all the instances of a given dataset. We show that our approach yields large improvements compared to standard in-context learning approaches on a battery of tasks, both algorithmic and soft reasoning. This result thus demonstrates that code synthesis can be applied to a much broader class of problems than previously considered. Our released dataset, fine-tuned models, and implementation can be found at \url{https://github.com/nweir127/CoGEX}.

Learning to Reason via Program Generation, Emulation, and Search

The paper "Learning to Reason via Program Generation, Emulation, and Search" proposes a novel methodology named CoGEX (Code Generation and Emulated Execution). This approach intends to extend the reasoning capabilities of LLMs (LMs) from strictly algorithmic tasks to softer reasoning challenges such as commonsense reasoning, moral decision-making, and understanding sarcasm. Traditional LMs trained for program synthesis excel at technical computations but are less suited for more subjective or nuanced reasoning tasks. CoGEX offers a compelling solution by generating what the authors term "pseudo-programs"—Python scripts where some function calls are underspecified, allowing the LM to leverage its latent knowledge to fill gaps during emulation.

Methodology

CoGEX operates by training LMs on pseudo-programs, enabling them to generate and "emulate" their execution. This involves creating code scripts that incorporate both definable reasoning processes and placeholders for more ambiguous reasoning steps. The LM predicts the outcomes of these placeholders during execution, simulating how these steps would resolve given the contextual information in the model's knowledge base.

The authors introduce a program search mechanism named CoTACS, which allows for task adaptation by identifying a general program that best fits a dataset. CoTACS employs a methodical search over potential programs generated by CoGEX to find one that optimizes performance across different data instances without updating LM parameters. This step embodies a crucial shift from solving tasks by instance-specific processing toward a broader application of a single generalizable program.

Results

Experiments spanning a variety of reasoning tasks—ranging from symbolic manipulation to commonsense questions—demonstrate that CoGEX outperforms baseline models, including few-shot examples from off-the-shelf LMs and instruction-tuned Alpaca models. Remarkably, CoGEX shows a significant improvement in less conventional reasoning tasks where traditional program synthesis would be ineffectual.

For instance, on tasks demanding numerical operations such as "sum of large numbers," CoGEX demonstrates marked effectiveness relative to established NL-based reasoning paradigms. Additionally, despite its code-centric process, CoGEX retains strong performance on text-oriented tasks like emotion classification and commonsense reasoning, showcasing its flexibility and potential broader applicability.

Implications and Future Directions

This approach suggests a new trajectory for AI reasoning development, bridging the gap between hard-coded reasoning processes and the softer, more contextually fluctuating inferences needed for human-like cognition. The integration of code generation and emulation allows LLMs to apply programmatic reasoning frameworks to sensibly diverse task domains, effectively broadening the mold of problems addressable by program synthesis methods.

Future work could explore refining these pseudo-programs for even deeper contextual understanding and improving program generalization across more dataset varieties. Increased functionality could also arise from refining the emulation process the LM undergoes during pseudo-execution, potentially bringing model reasoning closer to simulated human-like logic. This strategy paves the way toward LMs that can function autonomously across an expanded range of problem-solving contexts.

Overall, CoGEX exemplifies a significant advancement in reasoning capabilities for LMs, showcasing how code generation, aligned with an understanding of orchestration between various reasoning dimensions, can lead to more sophisticated, adaptable AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Language models as compilers: Simulating pseudocode execution improves algorithmic reasoning in language models. arXiv preprint arXiv:2404.02575, 2024.
  2. Instructzero: Efficient instruction optimization for black-box large language models. arXiv preprint arXiv:2306.03082, 2023.
  3. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  4. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  3369–3391, 2022.
  5. Pal: program-aided language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  6. Coverage-based example selection for in-context learning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  13924–13950, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.930. URL https://aclanthology.org/2023.findings-emnlp.930.
  7. Coverage-based example selection for in-context learning. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  13924–13950, 2023b.
  8. Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782, 2022.
  9. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  10. Braid: Weaving symbolic and neural knowledge into coherent logical explanations. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.  10867–10874, 2022.
  11. Exploring demonstration ensembling for in-context learning. arXiv preprint arXiv:2308.08780, 2023a.
  12. Few-shot reranking for multi-hop QA via language model prompting. In Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pp.  15882–15897. Association for Computational Linguistics, 2023b. doi: 10.18653/V1/2023.ACL-LONG.885. URL https://doi.org/10.18653/v1/2023.acl-long.885.
  13. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  14. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474, 2023.
  15. Prompting with pseudo-code instructions. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  16. Are NLP models really able to solve simple math word problems? In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2080–2094, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.168. URL https://aclanthology.org/2021.naacl-main.168.
  17. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  18. Grips: Gradient-free, edit-based instruction search for prompting large language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  3845–3864, 2023.
  19. Automatic prompt optimization with" gradient descent" and beam search. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  20. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  21. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2655–2671, 2022.
  22. Social IQa: Commonsense reasoning about social interactions. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4463–4473, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1454. URL https://aclanthology.org/D19-1454.
  23. CARER: Contextualized affect representations for emotion recognition. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  3687–3697, Brussels, Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1404. URL https://aclanthology.org/D18-1404.
  24. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  25. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631–1642, Seattle, Washington, USA, October 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170.
  26. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  27. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  28. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.
  29. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  31. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
  32. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019. doi: 10.1162/tacl_a_00290. URL https://aclanthology.org/Q19-1040.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  34. One-shot learning from a demonstration with hierarchical latent language. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pp.  2388–2390, 2023.
  35. NELLIE: A neuro-symbolic inference engine for grounded, compositional, and explainable reasoning. IJCAI, 2024.
  36. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. Advances in Neural Information Processing Systems, 36, 2024.
  37. Large language models as optimizers. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Bb4VGOWELI.
  38. Compositional exemplars for in-context learning. In International Conference on Machine Learning, pp.  39818–39833. PMLR, 2023a.
  39. Prompt engineering a prompt engineer. arXiv preprint arXiv:2311.05661, 2023b.
  40. Star: Bootstrapping reasoning with reasoning. Advances in Neural Information Processing Systems, 35:15476–15488, 2022.
  41. Exploring the curious case of code prompts. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), 2023a.
  42. Natural language embedded programs for hybrid language symbolic reasoning. arXiv preprint arXiv:2309.10814, 2023b.
  43. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9134–9148, 2022.
  44. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Nathaniel Weir (17 papers)
  2. Muhammad Khalifa (24 papers)
  3. Linlu Qiu (14 papers)
  4. Orion Weller (30 papers)
  5. Peter Clark (108 papers)
Citations (2)