Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Code Simulation Challenges for Large Language Models (2401.09074v4)

Published 17 Jan 2024 in cs.LG, cs.AI, cs.CL, and cs.PL

Abstract: Many reasoning, planning, and problem-solving tasks share an intrinsic algorithmic nature: correctly simulating each step is a sufficient condition to solve them correctly. This work studies to what extent LLMs can simulate coding and algorithmic tasks to provide insights into general capabilities in such algorithmic reasoning tasks. We introduce benchmarks for straight-line programs, code that contains critical paths, and approximate and redundant instructions. We further assess the simulation capabilities of LLMs with sorting algorithms and nested loops and show that a routine's computational complexity directly affects an LLM's ability to simulate its execution. While the most powerful LLMs exhibit relatively strong simulation capabilities, the process is fragile, seems to rely heavily on pattern recognition, and is affected by memorisation. We propose a novel off-the-shelf prompting method, Chain of Simulation (CoSm), which instructs LLMs to simulate code execution line by line/follow the computation pattern of compilers. CoSm efficiently helps LLMs reduce memorisation and shallow pattern recognition while improving simulation performance. We consider the success of CoSm in code simulation to be inspirational for other general routine simulation reasoning tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
  2. Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158, 2023.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
  4. Position information in transformers: An overview. Computational Linguistics, 48(3):733–763, 2022.
  5. Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
  6. Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023.
  7. Mathematical capabilities of ChatGPT. ArXiv preprint, abs/2301.13867, 2023.
  8. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  9. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
  10. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
  11. Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
  12. Gpt is becoming a turing machine: Here are some ways to program it. arXiv preprint arXiv:2303.14310, 2023.
  13. Large language models are zero-shot reasoners. ArXiv preprint, abs/2205.11916, 2022.
  14. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
  15. Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1:9, 2021.
  16. Code execution with pre-trained language models. arXiv preprint arXiv:2305.05383, 2023.
  17. How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN. Transactions of the Association for Computational Linguistics, 11:652–670, 2023.
  18. Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
  19. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
  20. In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
  21. OpenAI. GPT-4 technical report. ArXiv preprint, abs/2303.08774, 2023.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  23. Attention is turing complete. The Journal of Machine Learning Research, 22(1):3463–3497, 2021.
  24. When do prompting and prefix-tuning work? a theory of capabilities and limitations. arXiv preprint arXiv:2310.19698, 2023.
  25. Code Llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  26. Always provide context: The effects of code context on programming error message enhancement. In Proceedings of the ACM Conference on Global Computing Education Vol 1, pages 147–153, 2023.
  27. BLOOM: A 176B-parameter open-access multilingual language model. ArXiv preprint, abs/2211.05100, 2022.
  28. Dale Schuurmans. Memory augmented large language models are computationally universal. arXiv preprint arXiv:2301.04589, 2023.
  29. John R Searle. Minds, brains, and programs. Behavioral and brain sciences, 3(3):417–424, 1980.
  30. Murray Shanahan. Talking about large language models. ArXiv preprint, abs/2212.03551, 2022.
  31. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
  32. LLaMA: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023.
  33. LLaMA 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
  34. Predicting code coverage without execution. arXiv preprint arXiv:2307.13383, 2023.
  35. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In CHI ’22: CHI Conference on Human Factors in Computing Systems - 5 May 2022, Extended Abstracts, pages 332:1–332:7. ACM, 2022.
  36. Attention is all you need. In Advances in Neural Information Processing Systems 30: 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
  37. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019.
  38. Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems, 35:12071–12083, 2022.
  39. Chain of thought prompting elicits reasoning in large language models. ArXiv preprint, abs/2201.11903, 2022.
  40. Thinking like transformers. In International Conference on Machine Learning, pages 11080–11090. PMLR, 2021.
  41. The generative ai paradox:” what it can create, it may not understand”. arXiv preprint arXiv:2311.00059, 2023.
  42. Addressing compiler errors: Stack overflow or large language models? arXiv preprint arXiv:2307.10793, 2023.
  43. What do code models memorize? an empirical study on large language models of code. arXiv preprint arXiv:2308.09932, 2023.
  44. How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023.
  45. Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, 2023.
  46. One small step for generative AI, one giant leap for AGI: A complete survey on ChatGPT in AIGC era. ArXiv preprint, abs/2304.06488, 2023.
  47. Can transformers learn to solve problems recursively? arXiv preprint arXiv:2305.14699, 2023.
  48. What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Emanuele La Malfa (21 papers)
  2. Christoph Weinhuber (6 papers)
  3. Orazio Torre (2 papers)
  4. Fangru Lin (6 papers)
  5. Anthony Cohn (5 papers)
  6. Nigel Shadbolt (40 papers)
  7. Michael Wooldridge (59 papers)
  8. Samuele Marro (11 papers)
Citations (5)

Summary

Overview of LLM Capabilities in Code Simulation

The performance of LLMs in simulating the execution of computer programs is a significant area of interest. The capability of a model to turn algorithmic instructions into correct simulations is an indicator of its computational model prowess. This paper evaluates several prominent LLMs, including GPT-3.5-Turbo, GPT-4, Jurassic-Ultra, LlaMA2-70B, and CodeLlama-34b-Instruct, to understand the fidelity of their code simulation.

Methodological Approach

The investigation begins with straight line programs—simple programs without branches or loops—and extends to more complex structures with critical paths, redundant instructions, nested loops, and sorting algorithms. Code snippets were formulated in Python and tested in a zero-shot setting, where no prior examples are given to the model. A novel prompting method, Chain of Simulation (CoSm), was employed to enforce a sequential execution by LLMs and to tackle issues stemming from the models' memorization tendencies.

Results from Simulations

Results indicated that LLMs' accuracy diminishes as the length of the code increases. Models struggled with maintaining program states throughout longer sequences, indicating challenges with internal memory and computational complexity. Notably, GPT-4 provided the most reliable simulation for straight line programs. However, when code contained critical paths or needed fault-tolerant simulations, even sophisticated LLMs like GPT-4 did not demonstrate the ability to isolate and execute only the necessary parts for accurate simulations.

Memorization versus Simulation

An essential part of the paper focused on the tension between memorization and simulation in LLMs, where memorization often hinders the model's executional accuracy. This is exemplified in the variation of algorithms—for instance, models performed well on popular algorithms such as the Fibonacci sequence but faltered on slight deviations from these algorithms, like the Padovan sequence. The CoSm prompting method showed promising results in line-by-line code execution simulation, overshadowing standard prompting techniques, especially for varied algorithmic tasks.

Conclusion and Acknowledgements

This research concludes that while LLMs show promise in certain code simulation tasks, they fall short of reliably simulating digital devices for algorithms with higher computational complexity. Current models tend toward pattern recognition over genuine stepwise computation, and errors in simulations become more frequent with program complexity and length. The research advocates the continued exploration of LLMs' simulation capabilities, especially considering their memorization and recognition patterns.

Acknowledgments noted contributions from various patrons and institutions, such as the Economic and Social Research Council, the Alan Turing Institute, and partnerships with the UK government, reflecting the collaborative effort behind this paper.