Multi-Programming Language Sandbox for LLMs (2410.23074v2)
Abstract: We introduce MPLSandbox, an out-of-the-box multi-programming language sandbox designed to provide unified and comprehensive feedback from compiler and analysis tools for LLMs. It can automatically identify the programming language of the code, compiling and executing it within an isolated sub-sandbox to ensure safety and stability. In addition, MPLSandbox also integrates both traditional and LLM-based code analysis tools, providing a comprehensive analysis of generated code. MPLSandbox can be effortlessly integrated into the training and deployment of LLMs to improve the quality and correctness of their generated code. It also helps researchers streamline their workflows for various LLM-based code-related tasks, reducing the development cost. To validate the effectiveness of MPLSandbox, we integrate it into training and deployment approaches, and also employ it to optimize workflows for a wide range of real-world code-related tasks. Our goal is to enhance researcher productivity on LLM-based code-related tasks by simplifying and automating workflows through delegation to MPLSandbox.
- 2023. gpt-3.5-turbo. https://platform.openai.com/docs/models/gpt-3-5.
- 2023. Llama-2. https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models.
- 2024. astpretty. https://pypi.org/project/astpretty/.
- 2024. clang. https://clang.llvm.org/.
- 2024. coverage. https://coverage.readthedocs.io/en/7.6.4/.
- 2024. javalang. https://github.com/c2nes/javalang.
- 2024. joern. https://github.com/joernio/joern.
- 2024. jprofiler. https://www.ej-technologies.com/jprofiler.
- 2024. lineprofiler. https://github.com/pyutils/line_profiler.
- 2024. Llama-3. https://ai.meta.com/blog/meta-llama-3.
- 2024. LLMSandbox. https://hackernoon.com/introducing-llm-sandbox-securely-execute-llm-generated-code-with-ease.
- 2024. mistralAI. https://mistral.ai/news/codestral/
- 2024. pmd. https://pmd.github.io/.
- 2024. PromptFoo. https://www.promptfoo.dev/docs/guides/sandboxed-code-evals/.
- 2024. radon. https://pypi.org/project/radon/.
- 2024. Terrarium. https://github.com/cohere-ai/cohere-terrarium.
- Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024).
- A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653 (2020).
- Vulnerability Detection and Monitoring Using LLM. In 2023 IEEE 9th International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE). IEEE, 309–314.
- SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023).
- AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card 1 (2024).
- Specification-based code generation. In Twenty-Third Annual Hawaii International Conference on System Sciences, Vol. 2. IEEE Computer Society, 165–173.
- Qwen technical report. arXiv preprint arXiv:2309.16609 (2023).
- Triangulating python performance issues with SCALENE. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). 51–64.
- Enhancing Code Translation in Language Models with Few-Shot Learning via Retrieval-Augmented Generation. arXiv preprint arXiv:2407.19619 (2024).
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
- Chatunitest: A framework for llm-based test generation. In Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. 572–576.
- LLM-Enhanced Static Analysis for Precise Identification of Vulnerable OSS Versions. arXiv preprint arXiv:2408.07321 (2024).
- DeepSeek-AI. 2024. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism. arXiv preprint arXiv:2401.02954 (2024). https://github.com/deepseek-ai/DeepSeek-LLM
- Conversing with copilot: Exploring prompt engineering for solving cs1 problems using natural language. In Proceedings of the 54th ACM Technical Symposium on Computer Science Education V. 1. 1136–1142.
- What’s Wrong with Your Code Generated by Large Language Models? An Extensive Study. arXiv preprint arXiv:2407.06153 (2024).
- StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. arXiv preprint arXiv:2402.01391 (2024).
- Mercury: An efficiency benchmark for llm code synthesis. arXiv preprint arXiv:2402.07844 (2024).
- Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning. arXiv preprint arXiv:2406.03718 (2024).
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
- Pybox-a python sandbox. (2012).
- Search-Based LLMs for Code Optimization. arXiv preprint arXiv:2408.12159 (2024).
- A virtual machine introspection based architecture for intrusion detection.. In Ndss, Vol. 3. San Diega, CA, 191–206.
- Robert Gentleman and Duncan Temple Lang. 2007. Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics 16, 1 (2007), 1–23.
- Automated whitebox fuzz testing.. In NDSS, Vol. 8. 151–166.
- TestART: Improving LLM-based Unit Test via Co-evolution of Automated Generation and Repair Iteration. arXiv preprint arXiv:2408.03095 (2024).
- DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence. https://api.semanticscholar.org/CorpusID:267211867
- Andrew Habib and Michael Pradel. 2018. How many of all bugs do we find? a study of static bug detectors. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 317–328.
- Measuring Coding Challenge Competence With APPS. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, Joaquin Vanschoren and Sai-Kit Yeung (Eds.). https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/c24cd76e1ce41366a4bbe8a49b02a028-Abstract-round2.html
- Mistral 7B. arXiv preprint arXiv:2310.06825 (2023).
- Training LLMs to Better Self-Debug and Explain Code. arXiv preprint arXiv:2405.18649 (2024).
- Towards Understanding the Effectiveness of Large Language Models on Directed Test Input Generation. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 1408–1420.
- From LLMs to LLM-based Agents for Software Engineering: A Survey of Current, Challenges and Future. arXiv preprint arXiv:2408.02479 (2024).
- Bogdan Korel. 1990. Automated software test data generation. IEEE Transactions on software engineering 16, 8 (1990), 870–879.
- Jahnavi Kumar and Sridhar Chimalakonda. 2024. Code Summarization without Direct Access to Code-Towards Exploring Federated LLMs for Software Engineering. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 100–109.
- CodeChain: Towards Modular Code Generation Through Chain of Self-revisions with Representative Sub-modules. In The Twelfth International Conference on Learning Representations.
- Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems 35 (2022), 21314–21328.
- Djxperf: Identifying memory inefficiencies via object-centric profiling for java. In Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization. 81–94.
- Reflection-tuning: Data recycling improves llm instruction-tuning. arXiv preprint arXiv:2310.11716 (2023).
- Taco: Topics in algorithmic code generation dataset. arXiv preprint arXiv:2312.14852 (2023).
- Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
- Isolated program execution: An application transparent approach for executing untrusted programs. In 19th Annual Computer Security Applications Conference, 2003. Proceedings. IEEE, 182–191.
- Software vulnerability detection using deep neural networks: a survey. Proc. IEEE 108, 10 (2020), 1825–1848.
- Deep learning based code smell detection. IEEE transactions on Software Engineering 47, 9 (2019), 1811–1837.
- RLTF: Reinforcement Learning from Unit Test Feedback. arXiv preprint arXiv:2307.04349 (2023).
- Edward S Lowry and Cleburne W Medlock. 1969. Object code optimization. Commun. ACM 12, 1 (1969), 13–22.
- GRACE: Empowering LLM-based software vulnerability detection with graph structure and in-context learning. Journal of Systems and Software 212 (2024), 112031.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021).
- Bridging Gaps in LLM Code Translation: Reducing Errors with Call Graphs and Bridged Debuggers. In Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering. 2448–2449.
- SpecGen: Automated Generation of Formal Program Specifications via Large Language Models. arXiv preprint arXiv:2401.08807 (2024).
- The art, science, and engineering of fuzzing: A survey. IEEE Transactions on Software Engineering 47, 11 (2019), 2312–2331.
- Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:cs.CL/2303.08774
- Lost in translation: A study of bugs introduced by large language models while translating code. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
- Code smells detection and visualization: a systematic literature review. Archives of Computational Methods in Engineering 29, 1 (2022), 47–94.
- ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation. arXiv preprint arXiv:2405.17057 (2024).
- Unsupervised translation of programming languages. Advances in neural information processing systems 33 (2020), 20601–20611.
- Code-Aware Prompting: A Study of Coverage-Guided Test Generation in Regression Setting using LLM. Proceedings of the ACM on Software Engineering 1, FSE (2024), 951–971.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).
- Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks. arXiv preprint arXiv:2310.10508 (2023).
- Execution-based code generation using deep reinforcement learning. arXiv preprint arXiv:2301.13816 (2023).
- Unraveling the Potential of Large Language Models in Code Translation: How Far Are We? arXiv preprint arXiv:2410.09812 (2024).
- Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. https://qwenlm.github.io/blog/qwen2.5/
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Enhancing Trust in LLM-Generated Code Summaries with Calibrated Confidence Scores. arXiv preprint arXiv:2404.19318 (2024).
- Python Symbolic Execution with LLM-powered Code Generation. arXiv preprint arXiv:2409.09271 (2024).
- Execution-based evaluation for open-domain code generation. arXiv preprint arXiv:2212.10481 (2022).
- Fuzz4all: Universal fuzzing with large language models. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. 1–13.
- Large language models for cyber security: A systematic literature review. arXiv preprint arXiv:2405.04760 (2024).
- Rectifier: Code Translation with Corrector via LLMs. arXiv preprint arXiv:2407.07472 (2024).
- Beta-Coder: On Value-Based Deep Reinforcement Learning for Program Synthesis. In The Twelfth International Conference on Learning Representations.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302 (2023).
- Towards an understanding of large language models in software engineering tasks. arXiv preprint arXiv:2308.11396 (2023).
- DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence. arXiv preprint arXiv:2406.11931 (2024).
Collections
Sign up for free to add this paper to one or more collections.