Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents (2401.00812v2)

Published 1 Jan 2024 in cs.CL
If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code Empowers Large Language Models to Serve as Intelligent Agents

Abstract: The prominent LLMs of today differ from past LLMs not only in size, but also in the fact that they are trained on a combination of natural language and formal language (code). As a medium between humans and computers, code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity. In this survey, we present an overview of the various benefits of integrating code into LLMs' training data. Specifically, beyond enhancing LLMs in code generation, we observe that these unique properties of code help (i) unlock the reasoning ability of LLMs, enabling their applications to a range of more complex natural language tasks; (ii) steer LLMs to produce structured and precise intermediate steps, which can then be connected to external execution ends through function calls; and (iii) take advantage of code compilation and execution environment, which also provides diverse feedback for model improvement. In addition, we trace how these profound capabilities of LLMs, brought by code, have led to their emergence as intelligent agents (IAs) in situations where the ability to understand instructions, decompose goals, plan and execute actions, and refine from feedback are crucial to their success on downstream tasks. Finally, we present several key challenges and future directions of empowering LLMs with code.

Overview of Integrating Code into LLM Training

Training data plays a pivotal role in developing LLMs. The inclusion of formal language or code in the training materials greatly broadens the scope of what these models can do. Code, by nature, is structured and logically consistent, qualities that when learned can significantly enhance the capabilities of an LLM. The paper explores how code not only aids in improving LLMs' core programming abilities but also in boosting their problem-solving skills and connecting them to a variety of functional ends.

Enhancements in LLM Abilities

Programming Proficiency

One of the most noticeable improvements with code integration is in programming competence. LLMs excel in both writing and assessing code, surpassing previous benchmarks. They can manage multiple programming languages and have been applied to an array of tasks from software development to data analysis. This proficiency has also led to growth in LLMs' ability to evaluate code for errors and performance, indicating a comprehensive understanding of code beyond mere generation.

Complex Reasoning Capabilities

Another critical enhancement is in sophisticated reasoning skills. LLMs show improved performance in tasks that require methodical, step-by-step processing—key elements of programming logic. This has been evident in models showing improved 'chain of thought' capabilities, which allow them to break down complex tasks into smaller, more manageable steps, resulting in better decision-making processes.

Structured Knowledge Capture

LLMs trained with code also show advanced abilities to understand and generate content that involves structured information. The models can handle diverse forms of structured data, such as HTML or tabular representations. This ability is particularly important for interpreting multimedia information and enhances the LLMs' interaction with graph-structured data.

LLMs as Intelligent Agents

Decision Making and Execution

Through code training, LLMs have developed into more sophisticated IAs. They can perceive environments, plan, and execute actions more effectively, thanks to their improved ability to process structured information. The models can also use code to interact with multiple tools and APIs, allowing them to perform a wider range of tasks, including those that involve complex interplays between different modalities and physical-world interactions.

Self-improvement through Feedback

An important aspect of intelligent behavior is the ability to learn from feedback and adapt. Code-driven environments allow LLMs to collect highly reliable and precise feedback—essential for refining their performance. This is a self-reinforcing loop as better performance opens up avenues for receiving more feedback, which in turn leads to further improvements.

Challenges and Future Directions

Despite the progress, there are still some key questions and challenges. For one, it is not clear exactly how much of the enhancement in reasoning abilities comes directly from code pre-training. The nature of reasoning beyond what code provides is another area that invites further exploration. Lastly, perfecting methods to fully leverage feedback for performance improvement in multi-turn interactions is still ongoing, with potential avenues like reinforcement learning yet to be thoroughly explored.

Conclusion

The survey presents a robust argument for the value of including code in LLM training. It underscores the significant impact this inclusion has on the models' capabilities and their application as IAs. Nevertheless, it also offers a perspective on the challenges that remain, providing a catalyst for further research in this domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (185)
  1. Juice: A large scale distantly supervised dataset for open domain context-based code generation. In Conference on Empirical Methods in Natural Language Processing.
  2. Do as i can, not as i say: Grounding language in robotic affordances.
  3. Santacoder: don’t reach for the stars!
  4. Structural language models of code.
  5. Program synthesis with large language models.
  6. Tony Beltramelli. 2017. pix2code: Generating code from a graphical user interface screenshot.
  7. Emergent autonomous scientific research capabilities of large language models. ArXiv, abs/2304.05332.
  8. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  9. Chemcrow: Augmenting large-language models with chemistry tools.
  10. Large language models as tool makers. ArXiv, abs/2305.17126.
  11. Tommaso Calò and Luigi De Russis. 2023. Leveraging large language models for end-user website generation. In International Symposium on End User Development, pages 52–61. Springer.
  12. Improving code generation by training with natural language feedback.
  13. Codet: Code generation with generated tests.
  14. Juo-Tung Chen and Chien-Ming Huang. 2023. Forgetful large language models: Lessons learned from using llms in robot programming.
  15. Evaluating large language models trained on code.
  16. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.
  17. Teaching large language models to self-debug.
  18. Execution-guided neural program synthesis. In International Conference on Learning Representations.
  19. Vistruct: Visual structural knowledge extraction via curriculum guided code-vision representation.
  20. Binding language models in symbolic languages.
  21. Visual programming for text-to-image generation and evaluation.
  22. Training verifiers to solve math word problems.
  23. Textworld: A learning environment for text-based games. ArXiv, abs/1806.11532.
  24. Receive, reason, and react: Drive as you say with large language models in autonomous vehicles.
  25. A survey on multimodal large language models for autonomous driving.
  26. Pentestgpt: An llm-empowered automatic penetration testing tool.
  27. Mind2web: Towards a generalist agent for the web. ArXiv, abs/2306.06070.
  28. Language to logical form with neural attention.
  29. Codescore: Evaluating code generation by learning code execution. ArXiv, abs/2301.09043.
  30. Self-collaboration code generation via chatgpt.
  31. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 119(32).
  32. Classeval: A manually-crafted benchmark for evaluating llms on class-level code generation.
  33. Successive prompting for decomposing complex questions.
  34. Write, execute, assess: Program synthesis with a repl. Advances in Neural Information Processing Systems, 32.
  35. Large language models for software engineering: Survey and open problems.
  36. Layoutgpt: Compositional visual planning and generation with large language models. arXiv preprint arXiv:2305.15393.
  37. Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv preprint arXiv:2305.00955.
  38. Hao Fu, Yao; Peng and Tushar Khot. 2022. How does gpt obtain its ability? tracing emergent abilities of language models to their sources. Yao Fu’s Notion.
  39. Complexity-based prompting for multi-step reasoning.
  40. Multimodal web navigation with instruction-finetuned foundation models.
  41. Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
  42. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  43. CRITIC: Large language models can self-correct with tool-interactive critiquing.
  44. Deepfix: Fixing common c language errors by deep learning. In Proceedings of the aaai conference on artificial intelligence, volume 31.
  45. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962.
  46. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  47. Language models can teach themselves to program better. arXiv preprint arXiv:2207.14502.
  48. ToolkenGPT: Augmenting frozen language models with massive tools via tool embeddings.
  49. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938.
  50. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352.
  51. Execution-based evaluation for data science code generation models. arXiv preprint arXiv:2211.09374.
  52. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In International Conference on Machine Learning, pages 9118–9147. PMLR.
  53. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973.
  54. Applications of large scale foundation models for autonomous driving. arXiv preprint arXiv:2311.12144.
  55. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
  56. Fault-aware neural code rankers. Advances in Neural Information Processing Systems, 35:13419–13432.
  57. Atlas: Few-shot learning with retrieval augmented language models.
  58. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  59. Self-planning code generation with large language model. arXiv preprint arXiv:2303.06689.
  60. GeneGPT: Augmenting large language models with domain tools for improved access to biomedical information.
  61. A preliminary evaluation of llm-based fault localization. arXiv preprint arXiv:2308.05487.
  62. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2312–2323. IEEE.
  63. Smart-llm: Smart multi-agent robot task planning using large language models. arXiv preprint arXiv:2309.10062.
  64. Ds-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning, pages 18319–18345. PMLR.
  65. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328.
  66. Pix2struct: Screenshot parsing as pretraining for visual language understanding.
  67. Ioktong Lei and Zhidong Deng. 2023. Selfzcot: a self-prompt zero-shot cot from semantic-level to code-level for a better utilization of llms.
  68. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
  69. The hitchhiker’s guide to program analysis: A journey with large language models. arXiv preprint arXiv:2308.00245.
  70. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  71. Towards knowledge-driven autonomous driving. arXiv preprint arXiv:2312.04316.
  72. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  73. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE.
  74. Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525:140 – 146.
  75. Holistic evaluation of language models.
  76. Taskmatrix. ai: Completing tasks by connecting foundation models with millions of apis. arXiv preprint arXiv:2303.16434.
  77. Codehelp: Using large language models with guardrails for scalable support in programming classes. ArXiv, abs/2308.06921.
  78. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  79. A parse-then-place approach for generating graphic layouts from textual descriptions.
  80. Matcha: Enhancing visual language pretraining with math reasoning and chart derendering.
  81. A language-first approach for procedure planning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1941–1954.
  82. “what it wants me to say”: Bridging the abstraction gap between end-user programmers and code-generating large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–31.
  83. Webglm: Towards an efficient web-enhanced question answering system with human preferences. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
  84. WebGLM: Towards an efficient web-enhanced question answering system with human preferences.
  85. Agentbench: Evaluating llms as agents. ArXiv, abs/2308.03688.
  86. Dynamic llm-agent network: An llm-agent collaboration framework with agent team optimization. arXiv preprint arXiv:2310.02170.
  87. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842.
  88. Codexglue: A machine learning benchmark dataset for code understanding and generation.
  89. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379.
  90. At which training stage does code data help llms reasoning?
  91. Lampilot: An open benchmark dataset for autonomous driving with language model programs. arXiv preprint arXiv:2312.04372.
  92. Language models of code are few-shot commonsense learners.
  93. A language agent for autonomous driving. arXiv preprint arXiv:2311.10813.
  94. Augmented language models: a survey. ArXiv, abs/2302.07842.
  95. Lila: A unified benchmark for mathematical reasoning.
  96. Skipanalyzer: An embodied agent for code analysis with large language models. arXiv preprint arXiv:2310.18532.
  97. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  98. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pages 26106–26128. PMLR.
  99. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
  100. David Noever. 2023. Can large language models find and fix vulnerable software? arXiv preprint arXiv:2308.10345.
  101. OpenAI. 2023. Gpt-4 technical report.
  102. Jitesh H Panchal and Ziran Wang. 2023. Design of next-generation automotive systems: Challenges and research opportunities. Journal of Computing and Information Science in Engineering, 23(6).
  103. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  104. Talm: Tool augmented language models. ArXiv, abs/2205.12255.
  105. TALM: Tool augmented language models.
  106. Gorilla: Large language model connected with massive APIs.
  107. Check your facts and try again: Improving large language models with external knowledge and automated feedback.
  108. Stanislas Polu and Ilya Sutskever. 2020. Generative language modeling for automated theorem proving.
  109. Communicative agents for software development. arXiv preprint arXiv:2307.07924.
  110. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models.
  111. Tool learning with foundation models. arXiv preprint arXiv:2304.08354.
  112. Evaluating the text-to-sql capabilities of large language models.
  113. In-Context retrieval-augmented language models.
  114. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297.
  115. Surreal vr pong: Llm approach to game design.
  116. Alex Robinson. 2019. Sketch2code: Generating a website from a paper mockup.
  117. Training language models with natural language feedback. arXiv preprint arXiv:2204.14146, 8.
  118. Toolformer: Language models can teach themselves to use tools.
  119. Languagempc: Large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026.
  120. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
  121. Natural language to code translation with execution. arXiv preprint arXiv:2204.11454.
  122. Alfworld: Aligning text and embodied environments for interactive learning. ArXiv, abs/2010.03768.
  123. BlenderBot 3: a deployed conversational agent that continually learns to responsibly engage.
  124. Progprompt: Generating situated robot task plans using large language models.
  125. RestGPT: Connecting large language models with real-world RESTful APIs.
  126. Learning ui-to-code reverse generator using visual critic without rendering.
  127. Hierarchical prompting assists large language model on web navigation.
  128. Modular visual question answering via code generation. arXiv preprint arXiv:2306.05392.
  129. 3d-gpt: Procedural 3d modeling with large language models. arXiv preprint arXiv:2310.12945.
  130. Sql-palm: Improved large language model adaptation for text-to-sql.
  131. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128.
  132. Math agents: Computational infrastructure, mathematical embedding, and genomics. ArXiv, abs/2307.02502.
  133. Yashar Talebirad and Amirhossein Nadiri. 2023. Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314.
  134. ToolAlpaca: Generalized tool learning for language models with 3000 simulated cases.
  135. LaMDA: Language models for dialog applications.
  136. Neural rankers for code generation via inter-cluster modeling.
  137. Llama 2: Open foundation and fine-tuned chat models.
  138. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Chi conference on human factors in computing systems extended abstracts, pages 1–7.
  139. Chatgpt for robotics: Design principles and model abilities. Microsoft Auton. Syst. Robot. Res, 2:20.
  140. Towards understanding chain-of-thought prompting: An empirical study of what matters.
  141. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291.
  142. Demo2code: From summarizing demonstrations to synthesizing code via extended chain-of-thought. arXiv preprint arXiv:2305.16744.
  143. Dynamic neural program embedding for program repair. arXiv preprint arXiv:1711.07163.
  144. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432.
  145. Chatgpt as your vehicle co-pilot: An initial attempt. IEEE Transactions on Intelligent Vehicles.
  146. Compilable neural code generation with compiler feedback. arXiv preprint arXiv:2203.05132.
  147. Code4struct: Code generation for few-shot event structure prediction.
  148. Leti: Learning to generate from textual interactions. arXiv preprint arXiv:2305.10314.
  149. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv preprint arXiv:2309.10691.
  150. Wayve. 2023. LINGO-1: Exploring Natural Language for Autonomous Driving.
  151. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning.
  152. Emergent abilities of large language models.
  153. Chain-of-thought prompting elicits reasoning in large language models.
  154. Natural language generation and understanding of big code for AI-assisted programming: A review. Entropy, 25(6):888.
  155. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
  156. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155.
  157. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. ArXiv, abs/2308.08155.
  158. Autoformalization with large language models.
  159. The rise and potential of large language model based agents: A survey.
  160. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10.
  161. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381.
  162. Webshop: Towards scalable real-world web interaction with grounded language agents. ArXiv, abs/2207.01206.
  163. Webshop: Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems, volume 35, pages 20744–20757. Curran Associates, Inc.
  164. ReAct: Synergizing reasoning and acting in language models.
  165. React: Synergizing reasoning and acting in language models. ArXiv, abs/2210.03629.
  166. Retrieval-augmented multimodal language modeling.
  167. A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. arXiv preprint arXiv:2303.10420.
  168. Large language models are versatile decomposers: Decompose evidence and questions for table-based reasoning.
  169. Lumos: Towards language agents that are unified, modular, and open source.
  170. Pengcheng Yin and Graham Neubig. 2019. Reranking for neural semantic parsing. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  171. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task.
  172. Craft: Customizing llms by creating and retrieving from specialized toolsets. ArXiv, abs/2309.17428.
  173. Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464.
  174. Large language models for robotics: A survey. arXiv preprint arXiv:2311.07226.
  175. N-best hypotheses reranking for text-to-sql systems. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 663–670. IEEE.
  176. Pangu-α𝛼\alphaitalic_α: Large-scale autoregressive pretrained chinese language models with auto-parallel computation.
  177. Self-edit: Fault-aware code editor for code generation. arXiv preprint arXiv:2305.04087.
  178. Coder reviewer reranking for code generation. In International Conference on Machine Learning, pages 41832–41846. PMLR.
  179. An in-depth survey of large language model-based artificial intelligence agents.
  180. Hongyi Zheng and Abulhair Saparov. 2023. Noisy exemplars make large language models more robust: A domain-agnostic behavioral analysis.
  181. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
  182. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921.
  183. Llm as dba. arXiv preprint arXiv:2308.05481.
  184. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with text-based knowledge and memory. ArXiv, abs/2305.17144.
  185. Terry Yue Zhuo. 2023. Large language models are state-of-the-art evaluators of code generation. arXiv preprint arXiv:2304.14317.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Ke Yang (152 papers)
  2. Jiateng Liu (13 papers)
  3. John Wu (9 papers)
  4. Chaoqi Yang (17 papers)
  5. Yi R. Fung (31 papers)
  6. Sha Li (42 papers)
  7. Zixuan Huang (32 papers)
  8. Xu Cao (88 papers)
  9. Xingyao Wang (29 papers)
  10. Yiquan Wang (9 papers)
  11. Heng Ji (266 papers)
  12. ChengXiang Zhai (64 papers)
Citations (55)