Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can Language Models Pretend Solvers? Logic Code Simulation with LLMs (2403.16097v2)

Published 24 Mar 2024 in cs.AI, cs.LO, and cs.SE

Abstract: Transformer-based LLMs have demonstrated significant potential in addressing logic problems. capitalizing on the great capabilities of LLMs for code-related activities, several frameworks leveraging logical solvers for logic reasoning have been proposed recently. While existing research predominantly focuses on viewing LLMs as natural language logic solvers or translators, their roles as logic code interpreters and executors have received limited attention. This study delves into a novel aspect, namely logic code simulation, which forces LLMs to emulate logical solvers in predicting the results of logical programs. To further investigate this novel task, we formulate our three research questions: Can LLMs efficiently simulate the outputs of logic codes? What strength arises along with logic code simulation? And what pitfalls? To address these inquiries, we curate three novel datasets tailored for the logic code simulation task and undertake thorough experiments to establish the baseline performance of LLMs in code simulation. Subsequently, we introduce a pioneering LLM-based code simulation technique, Dual Chains of Logic (DCoL). This technique advocates a dual-path thinking approach for LLMs, which has demonstrated state-of-the-art performance compared to other LLM prompt strategies, achieving a notable improvement in accuracy by 7.06% with GPT-4-Turbo.

Exploring the Frontier of Logic Code Simulation with LLMs

Introduction

In recent advancements within the field of artificial intelligence and software engineering, the potential of transformer-based LLMs to tackle logic problems has become increasingly evident. This paper ventures into the relatively unexplored territory of using LLMs not just as tools for understanding or translating logic codes but as simulators that can predict the outcomes of logical programs. By formulating novel research questions and introducing a unique dataset and method, this paper boldly steps into assessing the capacity of LLMs to act as logic solvers themselves.

Logic Code Simulation with LLMs

At the core of this research lies the question of whether LLMs can effectively simulate logic codes, essentially emulating the output that would result from executing the logic within a program. This involves comprehending the program's logic, engaging in logic reasoning, and converting the reasoning process back into the expected outcome of code execution. Through extensive experimentation using the newly curated datasets tailored specifically for logic code simulation, this paper unveils the groundbreaking technique Dual Chains of Logic (DCoL). This method significantly outperforms existing strategies in logic code simulation, marking a substantial step forward in the capabilities of LLMs.

Dataset and Experimentation

Unique to this paper, new datasets derived from the solver community are introduced, namely Z3Tutorial, Z3Test, and SMTSim, gathering diverse logic simulation problems. These datasets are pivotal in systematically evaluating the performance of various LLMs, including GPT-3.5 Turbo, GPT-4 Turbo, and the LLaMA-2-13B models, against the proposed logic code simulation task. The innovation doesn't stop there; the introduction of the Dual Chains of Logic (DCoL) technique encourages LLMs to engage in a dual-path reasoning approach. This method substantially enhances the models' accuracy and robustness in code simulation tasks, with GPT-4-Turbo witnessing a remarkable 7.06% improvement in accuracy.

Findings and Implications

The experiments conducted reveal intriguing insights into the capabilities and limitations of current LLMs in simulating logic code. GPT series models show a strong aptitude for logic simulation, highlighting their advanced understanding and reasoning abilities. Meanwhile, the LLaMA models, though effective, exhibit a tendency to generate a higher incidence of "unknown" outcomes, suggesting a potential area for model refinement.

A notable strength of LLMs, as identified in this paper, is their capacity to process and simulate logic codes even in the presence of syntax errors, showcasing a remarkable level of robustness and flexibility. Moreover, the paper highlights LLMs' potential in transcending some of the theoretical limitations inherent to traditional solvers, providing a promising avenue for future developments in logic problem-solving.

Looking Ahead

While the results of this paper are promising, they also underscore the challenges and complexities of logic code simulation with LLMs. The DCoL method represents a significant advancement, yet there remains ample scope for refinement and exploration. Future work will aim not only to enhance the performance and applicability of DCoL but also to extend its utility beyond the field of logic solvers. The integration of LLMs with additional knowledge retrieval and storage techniques could pave the way for practical applications that efficiently simulate complex logic programs in real-life scenarios.

Conclusion

This paper marks a pivotal moment in the exploration of LLMs' capabilities as logic code simulators. By proposing a novel task, introducing a dedicated framework, and systematically evaluating the performance across various datasets and models, the research opens up new horizons in the application of LLMs in software engineering and beyond. The findings not only provide a solid foundation for future inquiry but also inspire the continued evolution of AI-driven logic simulation methodologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” in International conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2008, pp. 337–340.
  2. H. Barbosa, C. Barrett, M. Brain, G. Kremer, H. Lachnitt, M. Mann, A. Mohamed, M. Mohamed, A. Niemetz, A. Nötzli et al., “cvc5: A versatile and industrial-strength smt solver,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2022, pp. 415–442.
  3. L. Cordeiro and B. Fischer, “Verifying multi-threaded software using smt-based context-bounded model checking,” in Proceedings of the 33rd International Conference on Software Engineering, 2011, pp. 331–340.
  4. G. Soltana, M. Sabetzadeh, and L. C. Briand, “Practical constraint solving for generating system test data,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 29, no. 2, pp. 1–48, 2020.
  5. E. Kang, S. Lafortune, and S. Tripakis, “Automated synthesis of secure platform mappings,” in Computer Aided Verification: 31st International Conference, CAV 2019, New York City, NY, USA, July 15-18, 2019, Proceedings, Part I 31.   Springer, 2019, pp. 219–237.
  6. M. R. Gadelha, E. Steffinlongo, L. C. Cordeiro, B. Fischer, and D. Nicole, “Smt-based refutation of spurious bug reports in the clang static analyzer,” in 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).   IEEE, 2019, pp. 11–14.
  7. S. Cai, B. Li, and X. Zhang, “Local search for smt on linear integer arithmetic,” in International Conference on Computer Aided Verification.   Springer, 2022, pp. 227–248.
  8. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  9. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  10. L. Pan, A. Albalak, X. Wang, and W. Wang, “Logic-lm: Empowering large language models with symbolic solvers for faithful logical reasoning,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 3806–3824.
  11. J. Lee and W. Hwang, “Symba: Symbolic backward chaining for multi-step natural language reasoning,” arXiv preprint arXiv:2402.12806, 2024.
  12. J. Feng, R. Xu, J. Hao, H. Sharma, Y. Shen, D. Zhao, and W. Chen, “Language models can be logical solvers,” arXiv preprint arXiv:2311.06158, 2023.
  13. O. Tafjord, B. Dalvi, and P. Clark, “Proofwriter: Generating implications, proofs, and abductive statements over natural language,” in Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 3621–3634.
  14. A. Saparov and H. He, “Language models are greedy reasoners: A systematic formal analysis of chain-of-thought,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=qFVVBzXxR2V
  15. S. Han, H. Schoelkopf, Y. Zhao, Z. Qi, M. Riddell, L. Benson, L. Sun, E. Zubova, Y. Qiao, M. Burtell et al., “Folio: Natural language reasoning with first-order logic,” arXiv preprint arXiv:2209.00840, 2022.
  16. W. Zhong, S. Wang, D. Tang, Z. Xu, D. Guo, Y. Chen, J. Wang, J. Yin, M. Zhou, and N. Duan, “Analytical reasoning of text,” in Findings of the Association for Computational Linguistics: NAACL 2022, 2022, pp. 2306–2319.
  17. Z. Li, Y. Cao, X. Xu, J. Jiang, X. Liu, Y. S. Teo, S.-w. Lin, and Y. Liu, “Llms for relational reasoning: How far are we?” arXiv preprint arXiv:2401.09042, 2024.
  18. S. Zhang, X. Gu, Y. Chen, and B. Shen, “Infere: Step-by-step regex generation via chain of inference,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2023, pp. 1505–1515.
  19. M. Liu, T. Yang, Y. Lou, X. Du, Y. Wang, and X. Peng, “Codegen4libs: A two-stage approach for library-oriented code generation,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2023, pp. 434–445.
  20. H. Yu, B. Shen, D. Ran, J. Zhang, Q. Zhang, Y. Ma, G. Liang, Y. Li, Q. Wang, and T. Xie, “Codereval: A benchmark of pragmatic code generation with generative pre-trained models,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–12.
  21. L. Ma, W. Yang, B. Xu, S. Jiang, B. Fei, J. Liang, M. Zhou, and Y. Xiao, “Knowlog: Knowledge enhanced pre-trained language model for log understanding,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–13.
  22. J. Xu, Z. Cui, Y. Zhao, X. Zhang, S. He, P. He, L. Li, Y. Kang, Q. Lin, Y. Dang et al., “Unilog: Automatic logging via llm and in-context learning,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–12.
  23. C. Wang, Y. Lou, J. Liu, and X. Peng, “Generating variable explanations via zero-shot prompt learning,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2023, pp. 748–760.
  24. P. Gupta, A. Khare, Y. Bajpai, S. Chakraborty, S. Gulwani, A. Kanade, A. Radhakrishna, G. Soares, and A. Tiwari, “Grace: Language models meet code edits,” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1483–1495.
  25. M. Sun, Y. Yang, Y. Wang, M. Wen, H. Jia, and Y. Zhou, “Smt solver validation empowered by large pre-trained language models,” in 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE).   IEEE, 2023, pp. 1288–1300.
  26. Y. Deng, C. S. Xia, C. Yang, S. D. Zhang, S. Yang, and L. Zhang, “Large language models are edge-case generators: Crafting unusual programs for fuzzing deep learning libraries,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–13.
  27. A. Z. Yang, C. Le Goues, R. Martins, and V. Hellendoorn, “Large language models for test-free fault localization,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024, pp. 1–12.
  28. Y. Sun, D. Wu, Y. Xue, H. Liu, H. Wang, Z. Xu, X. Xie, and Y. Liu, “Gptscan: Detecting logic vulnerabilities in smart contracts by combining gpt with program analysis,” Proc. IEEE/ACM ICSE, 2024.
  29. E. La Malfa, C. Weinhuber, O. Torre, F. Lin, A. Cohn, N. Shadbolt, and M. Wooldridge, “Code simulation challenges for large language models,” arXiv preprint arXiv:2401.09074, 2024.
  30. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
  31. D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. V. Le, and E. H. Chi, “Least-to-most prompting enables complex reasoning in large language models,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=WZH7099tgfM
  32. Q. Lyu, S. Havaldar, A. Stein, L. Zhang, D. Rao, E. Wong, M. Apidianaki, and C. Callison-Burch, “Faithful chain-of-thought reasoning,” in Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 305–329.
  33. Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memisevic, and H. Su, “Deductive verification of chain-of-thought reasoning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  34. T. Olausson, A. Gu, B. Lipkin, C. Zhang, A. Solar-Lezama, J. Tenenbaum, and R. Levy, “Linc: A neurosymbolic approach for logical reasoning by combining language models with first-order logic provers,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 5153–5176.
  35. X. Ye, Q. Chen, I. Dillig, and G. Durrett, “Satlm: Satisfiability-aided language models using declarative prompting,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  36. Y. Zhang, H.-L. Zhen, Z. Pei, Y. Lian, L. Yin, M. Yuan, and B. Yu, “Sola: Solver-layer adaption of llm for better logic reasoning,” arXiv preprint arXiv:2402.11903, 2024.
  37. J. Pérez, P. Barceló, and J. Marinkovic, “Attention is turing-complete,” Journal of Machine Learning Research, vol. 22, no. 75, pp. 1–35, 2021.
  38. C. Wei, Y. Chen, and T. Ma, “Statistically meaningful approximation: a case study on approximating turing machines with transformers,” Advances in Neural Information Processing Systems, vol. 35, pp. 12 071–12 083, 2022.
  39. A. Giannou, S. Rajput, J.-y. Sohn, K. Lee, J. D. Lee, and D. Papailiopoulos, “Looped transformers as programmable computers,” in International Conference on Machine Learning.   PMLR, 2023, pp. 11 398–11 442.
  40. D. Schuurmans, “Memory augmented large language models are computationally universal,” arXiv preprint arXiv:2301.04589, 2023.
  41. G. Kim, P. Baldi, and S. McAleer, “Language models can solve computer tasks,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  42. C. Liu, S. Lu, W. Chen, D. Jiang, A. Svyatkovskiy, S. Fu, N. Sundaresan, and N. Duan, “Code execution with pre-trained language models,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 4984–4999.
  43. Z. Shi, M. Li, Y. Liu, S. Khan, J. Huang, H.-L. Zhen, M. Yuan, and Q. Xu, “Satformer: Transformer-based unsat core learning,” in 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD).   IEEE, 2023, pp. 1–4.
  44. B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
  45. D. Beyer, “Competition on software verification and witness validation: Sv-comp 2023,” in International Conference on Tools and Algorithms for the Construction and Analysis of Systems.   Springer, 2023, pp. 495–522.
  46. L. Wang, W. Xu, Y. Lan, Z. Hu, Y. Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 2609–2634.
  47. X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” in The Eleventh International Conference on Learning Representations, 2022.
  48. S. Min, X. Lyu, A. Holtzman, M. Artetxe, M. Lewis, H. Hajishirzi, and L. Zettlemoyer, “Rethinking the role of demonstrations: What makes in-context learning work?” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 11 048–11 064. [Online]. Available: https://aclanthology.org/2022.emnlp-main.759
  49. X. Chen, R. A. Chi, X. Wang, and D. Zhou, “Premise order matters in reasoning with large language models,” arXiv preprint arXiv:2402.08939, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Minyu Chen (4 papers)
  2. Guoqiang Li (38 papers)
  3. Ling-I Wu (3 papers)
  4. Ruibang Liu (5 papers)
  5. Yuxin Su (37 papers)
  6. Xi Chang (2 papers)
  7. Jianxin Xue (1 paper)