Code Simulation Challenges for Large Language Models (2401.09074v4)
Abstract: Many reasoning, planning, and problem-solving tasks share an intrinsic algorithmic nature: correctly simulating each step is a sufficient condition to solve them correctly. This work studies to what extent LLMs can simulate coding and algorithmic tasks to provide insights into general capabilities in such algorithmic reasoning tasks. We introduce benchmarks for straight-line programs, code that contains critical paths, and approximate and redundant instructions. We further assess the simulation capabilities of LLMs with sorting algorithms and nested loops and show that a routine's computational complexity directly affects an LLM's ability to simulate its execution. While the most powerful LLMs exhibit relatively strong simulation capabilities, the process is fragile, seems to rely heavily on pattern recognition, and is affected by memorisation. We propose a novel off-the-shelf prompting method, Chain of Simulation (CoSm), which instructs LLMs to simulate code execution line by line/follow the computation pattern of compilers. CoSm efficiently helps LLMs reduce memorisation and shallow pattern recognition while improving simulation performance. We consider the success of CoSm in code simulation to be inspirational for other general routine simulation reasoning tasks.
- The reversal curse: Llms trained on” a is b” fail to learn” b is a”. arXiv preprint arXiv:2309.12288, 2023.
- Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158, 2023.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020.
- Position information in transformers: An overview. Computational Linguistics, 48(3):733–763, 2022.
- Faith and fate: Limits of transformers on compositionality. arXiv preprint arXiv:2305.18654, 2023.
- Who’s harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023.
- Mathematical capabilities of ChatGPT. ArXiv preprint, abs/2301.13867, 2023.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
- Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
- Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
- Gpt is becoming a turing machine: Here are some ways to program it. arXiv preprint arXiv:2303.14310, 2023.
- Large language models are zero-shot reasoners. ArXiv preprint, abs/2205.11916, 2022.
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 9493–9500. IEEE, 2023.
- Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1:9, 2021.
- Code execution with pre-trained language models. arXiv preprint arXiv:2305.05383, 2023.
- How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN. Transactions of the Association for Computational Linguistics, 11:652–670, 2023.
- Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638, 2023.
- Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
- In-context learning and induction heads. Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html.
- OpenAI. GPT-4 technical report. ArXiv preprint, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Attention is turing complete. The Journal of Machine Learning Research, 22(1):3463–3497, 2021.
- When do prompting and prefix-tuning work? a theory of capabilities and limitations. arXiv preprint arXiv:2310.19698, 2023.
- Code Llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Always provide context: The effects of code context on programming error message enhancement. In Proceedings of the ACM Conference on Global Computing Education Vol 1, pages 147–153, 2023.
- BLOOM: A 176B-parameter open-access multilingual language model. ArXiv preprint, abs/2211.05100, 2022.
- Dale Schuurmans. Memory augmented large language models are computationally universal. arXiv preprint arXiv:2301.04589, 2023.
- John R Searle. Minds, brains, and programs. Behavioral and brain sciences, 3(3):417–424, 1980.
- Murray Shanahan. Talking about large language models. ArXiv preprint, abs/2212.03551, 2022.
- Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789, 2023.
- LLaMA: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971, 2023.
- LLaMA 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023.
- Predicting code coverage without execution. arXiv preprint arXiv:2307.13383, 2023.
- Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In CHI ’22: CHI Conference on Human Factors in Computing Systems - 5 May 2022, Extended Abstracts, pages 332:1–332:7. ACM, 2022.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008, 2017.
- Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019.
- Statistically meaningful approximation: a case study on approximating turing machines with transformers. Advances in Neural Information Processing Systems, 35:12071–12083, 2022.
- Chain of thought prompting elicits reasoning in large language models. ArXiv preprint, abs/2201.11903, 2022.
- Thinking like transformers. In International Conference on Machine Learning, pages 11080–11090. PMLR, 2021.
- The generative ai paradox:” what it can create, it may not understand”. arXiv preprint arXiv:2311.00059, 2023.
- Addressing compiler errors: Stack overflow or large language models? arXiv preprint arXiv:2307.10793, 2023.
- What do code models memorize? an empirical study on large language models of code. arXiv preprint arXiv:2308.09932, 2023.
- How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015, 2023.
- Large language models meet nl2code: A survey. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7443–7464, 2023.
- One small step for generative AI, one giant leap for AGI: A complete survey on ChatGPT in AIGC era. ArXiv preprint, abs/2304.06488, 2023.
- Can transformers learn to solve problems recursively? arXiv preprint arXiv:2305.14699, 2023.
- What algorithms can transformers learn? a study in length generalization. arXiv preprint arXiv:2310.16028, 2023.
- Emanuele La Malfa (21 papers)
- Christoph Weinhuber (6 papers)
- Orazio Torre (2 papers)
- Fangru Lin (6 papers)
- Anthony Cohn (5 papers)
- Nigel Shadbolt (40 papers)
- Michael Wooldridge (59 papers)
- Samuele Marro (11 papers)