Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 113 tok/s
GPT OSS 120B 472 tok/s Pro
Kimi K2 214 tok/s Pro
2000 character limit reached

Multi-Programming Language Sandbox for LLMs (2410.23074v2)

Published 30 Oct 2024 in cs.SE and cs.CL

Abstract: We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for LLMs. It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox also integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. MPLSandbox can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of their generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we integrate it into training and deployment approaches, and also employ it to optimize workflows for a wide range of real-world code-related tasks. Our goal is to enhance researcher productivity on LLM-based code-related tasks by simplifying and automating workflows through delegation to MPLSandbox.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. 2023. gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5.
  2. 2023. Llama-2. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models.
  3. 2024. astpretty. https://pypi.org/project/astpretty/.
  4. 2024. clang. https://clang.llvm.org/.
  5. 2024. coverage. https://coverage.readthedocs.io/en/7.6.4/.
  6. 2024. javalang. https://github.com/c2nes/javalang.
  7. 2024. joern. https://github.com/joernio/joern.
  8. 2024. jprofiler. https://www.ej-technologies.com/jprofiler.
  9. 2024. lineprofiler. https://github.com/pyutils/line_profiler.
  10. 2024. Llama-3. https://ai.meta.com/blog/meta-llama-3.
  11. 2024. LLMSandbox. https://hackernoon.com/introducing-llm-sandbox-securely-execute-llm-generated-code-with-ease.
  12. 2024. mistralAI. https://mistral.ai/news/codestral/
  13. 2024. pmd. https://pmd.github.io/.
  14. 2024. PromptFoo. https://www.promptfoo.dev/docs/guides/sandboxed-code-evals/.
  15. 2024. radon. https://pypi.org/project/radon/.
  16. 2024. Terrarium. https://github.com/cohere-ai/cohere-terrarium.
  17. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024).
  18. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653 (2020).
  19. Vulnerability Detection and Monitoring Using LLM. In 2023 IEEE 9th International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE). IEEE, 309–314.
  20. SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023).
  21. AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card 1 (2024).
  22. Specification-based code generation. In Twenty-Third Annual Hawaii International Conference on System Sciences, Vol. 2. IEEE Computer Society, 165–173.
  23. Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
  24. Triangulating python performance issues with SCALENE. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 51–64.
  25. Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation. arXiv preprint arXiv:2407.19619 (2024).
  26. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  27. Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 572–576.
  28. LLM-Enhanced Static Analysis for Precise Identification of Vulnerable OSS Versions. arXiv preprint arXiv:2408.07321 (2024).
  29. DeepSeek-AI. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. arXiv preprint arXiv:2401.02954 (2024). https://github.com/deepseek-ai/DeepSeek-LLM
  30. Conversing with copilot: Exploring prompt engineering for solving cs1 problems using natural language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 1136–1142.
  31. What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. arXiv preprint arXiv:2407.06153 (2024).
  32. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. arXiv preprint arXiv:2402.01391 (2024).
  33. Mercury: An efficiency benchmark for llm code synthesis. arXiv preprint arXiv:2402.07844 (2024).
  34. Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning. arXiv preprint arXiv:2406.03718 (2024).
  35. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
  36. Pybox-a python sandbox. (2012).
  37. Search-Based LLMs for Code Optimization. arXiv preprint arXiv:2408.12159 (2024).
  38. A virtual machine introspection based architecture for intrusion detection.. In Ndss, Vol. 3. San Diega, CA, 191–206.
  39. Robert Gentleman and Duncan Temple Lang. 2007. Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics 16, 1 (2007), 1–23.
  40. Automated whitebox fuzz testing.. In NDSS, Vol. 8. 151–166.
  41. TestART: Improving LLM-based Unit Test via Co-evolution of Automated Generation and Repair Iteration. arXiv preprint arXiv:2408.03095 (2024).
  42. DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. https://api.semanticscholar.org/CorpusID:267211867
  43. Andrew Habib and Michael Pradel. 2018. How many of all bugs do we find? a study of static bug detectors. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 317–328.
  44. Measuring Coding Challenge Competence With APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html
  45. Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
  46. Training LLMs to Better Self-Debug and Explain Code. arXiv preprint arXiv:2405.18649 (2024).
  47. Towards Understanding the Effectiveness of Large Language Models on Directed Test Input Generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1408–1420.
  48. From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future. arXiv preprint arXiv:2408.02479 (2024).
  49. Bogdan Korel. 1990. Automated software test data generation. IEEE Transactions on software engineering 16, 8 (1990), 870–879.
  50. Jahnavi Kumar and Sridhar Chimalakonda. 2024. Code Summarization without Direct Access to Code-Towards Exploring Federated LLMs for Software Engineering. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 100–109.
  51. CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules. In The Twelfth International Conference on Learning Representations.
  52. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35 (2022), 21314–21328.
  53. Djxperf: Identifying memory inefficiencies via object-centric profiling for java. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization. 81–94.
  54. Reflection-tuning: Data recycling improves llm instruction-tuning. arXiv preprint arXiv:2310.11716 (2023).
  55. Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852 (2023).
  56. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
  57. Isolated program execution: An application transparent approach for executing untrusted programs. In 19th Annual Computer Security Applications Conference, 2003. Proceedings. IEEE, 182–191.
  58. Software vulnerability detection using deep neural networks: a survey. Proc. IEEE 108, 10 (2020), 1825–1848.
  59. Deep learning based code smell detection. IEEE transactions on Software Engineering 47, 9 (2019), 1811–1837.
  60. RLTF: Reinforcement Learning from Unit Test Feedback. arXiv preprint arXiv:2307.04349 (2023).
  61. Edward S Lowry and Cleburne W Medlock. 1969. Object code optimization. Commun. ACM 12, 1 (1969), 13–22.
  62. GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning. Journal of Systems and Software 212 (2024), 112031.
  63. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
  64. Bridging Gaps in LLM Code Translation: Reducing Errors with Call Graphs and Bridged Debuggers. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2448–2449.
  65. SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. arXiv preprint arXiv:2401.08807 (2024).
  66. The art, science, and engineering of fuzzing: A survey. IEEE Transactions on Software Engineering 47, 11 (2019), 2312–2331.
  67. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
  68. OpenAI. 2023. GPT-4 Technical Report. arXiv:cs.CL/2303.08774
  69. Lost in translation: A study of bugs introduced by large language models while translating code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
  70. Code smells detection and visualization: a systematic literature review. Archives of Computational Methods in Engineering 29, 1 (2022), 47–94.
  71. ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation. arXiv preprint arXiv:2405.17057 (2024).
  72. Unsupervised translation of programming languages. Advances in neural information processing systems 33 (2020), 20601–20611.
  73. Code-Aware Prompting: A Study of Coverage-Guided Test Generation in Regression Setting using LLM. Proceedings of the ACM on Software Engineering 1, FSE (2024), 951–971.
  74. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
  75. Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks. arXiv preprint arXiv:2310.10508 (2023).
  76. Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816 (2023).
  77. Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? arXiv preprint arXiv:2410.09812 (2024).
  78. Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm.github.io/blog/qwen2.5/
  79. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  80. Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores. arXiv preprint arXiv:2404.19318 (2024).
  81. Python Symbolic Execution with LLM-powered Code Generation. arXiv preprint arXiv:2409.09271 (2024).
  82. Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481 (2022).
  83. Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
  84. Large language models for cyber security: A systematic literature review. arXiv preprint arXiv:2405.04760 (2024).
  85. Rectifier: Code Translation with Corrector via LLMs. arXiv preprint arXiv:2407.07472 (2024).
  86. Beta-Coder: On Value-Based Deep Reinforcement Learning for Program Synthesis. In The Twelfth International Conference on Learning Representations.
  87. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302 (2023).
  88. Towards an understanding of large language models in software engineering tasks. arXiv preprint arXiv:2308.11396 (2023).
  89. DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. arXiv preprint arXiv:2406.11931 (2024).
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces MPLSandbox, a tool that integrates multi-language support and isolated sub-environments to enhance LLM-generated code reliability.
  • The tool employs distributed architecture and compiler feedback to significantly improve code accuracy metrics like Pass@1 and Pass@10.
  • MPLSandbox enables self-correction and optimization of generated code, streamlining software development and reinforcing policy optimization.

Overview of "Multi-Programming Language Sandbox for LLMs"

The paper "Multi-Programming Language Sandbox for LLMs" introduces MPLSandbox, a novel tool designed to enhance the reliability and quality of code generated by LLMs. This capability is especially relevant given the increasing application of LLMs in software development tasks, where the accuracy and efficiency of generated code are imperative. MPLSandbox addresses the challenges of integrating multi-language support and comprehensive code analysis in a single framework, providing a robust solution for developers and researchers.

Key Features and Contributions

MPLSandbox is characterized by several notable features that distinguish it from existing sandboxes:

  1. Security and Stability: The sandbox constructs isolated sub-environments for different programming languages, ensuring that safety is maintained even if the generated code contains potential vulnerabilities or bugs. This feature is crucial for preventing external environment harm during execution.
  2. Multi-Language Support: Unlike typical sandboxes catering to single programming languages, MPLSandbox supports multiple languages, including Python, Java, C++, C#, Bash, Go, JavaScript, and TypeScript. This capability drastically reduces the development cost associated with setting up individual environments for different languages.
  3. Usability and Extensibility: The tool is designed to seamlessly integrate various code analysis and compiler feedback tools for each programming language. Furthermore, MPLSandbox provides templates that allow users to incorporate additional tools, thereby expanding its applicability.
  4. Distributed Architecture: The tool can be deployed in a distributed system, ensuring efficiency in large-scale settings such as during extensive training sessions involving LLMs.

Experimental Validation

The paper reports extensive experiments to validate the effectiveness of MPLSandbox across multiple scenarios:

  • Inference Time Verification: By using the sandbox as a verifier, the authors demonstrate that it can successfully evaluate the correctness of model-generated code across various programming languages, showing significant improvements in code accuracy metrics like Pass@1 and Pass@10.
  • Reinforcement Learning Enhancement: MPLSandbox provides compiler feedback as a supervised signal to reinforce policy optimization in LLMs, demonstrating notable gains in performance during code generation tasks.
  • Self-Correction and Optimization: The integration of code analysis for self-correction highlights the sandbox's ability to autonomously fine-tune and enhance generated code, reducing complexity and improving maintainability.

Implications and Future Work

The introduction of MPLSandbox presents significant implications for both practical and theoretical aspects of AI and software development:

  • Practical Applications: MPLSandbox's comprehensive multilingual support, coupled with its secure and stable environment, provides robust infrastructure for developers and researchers to build, test, and deploy LLM-generated code. This advancement can streamline workflows in software engineering tasks such as bug fixing, unit test generation, and code translation.
  • Theoretical Advancements: The framework offers a structured approach for integrating compiler feedback into LLM training, suggesting a promising direction for future research in enhancing LLM performance using real-world execution data.

In conclusion, MPLSandbox is a significant contribution to the field of using LLMs in software engineering. It not only simplifies the complexity of employing LLMs in code-related tasks but also provides a structured pathway for future investigations into improving code quality through comprehensive compiler and analysis feedback.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com