Automatic Generation of Python Programs Using Context-Free Grammars (2403.06503v1)
Abstract: In recent years, data has emerged as the new gold, serving as a powerful tool for creating intelligent systems. However, procuring high-quality data remains challenging, especially for code. To address this, we developed TinyPy Generator, a tool that generates random Python programs using a context-free grammar. The generated programs are guaranteed to be correct by construction. Our system uses custom production rules (in the Backus-Naur Form (BNF) format) to recursively generate code. This allows us to generate code with different levels of complexity, ranging from code containing only assignments to more complex code containing conditionals and loops. Our proposed tool enables effortless large-scale Python code generation, beneficial for a wide range of applications. TinyPy Generator is particularly useful in the field of machine learning, where it can generate substantial amounts of Python code for training Python LLMs. Additionally, researchers who are studying programming languages can utilize this tool to create datasets for their experiments, which can help validate the robustness of code interpreters or compilers. Unlike existing research, we have open-sourced our implementation. This allows customization according to user needs and extends potential usage to other languages.
- Abejide Ade-Ibijola. Syntactic Generation of Practice Novice Programs in Python. In Communications in Computer and Information Science, April 2018.
- Andrej. karpathy/nanoGPT, December 2022. URL https://github.com/karpathy/nanoGPT.
- Backus-Naur form (BNF). In Encyclopedia of Computer Science, pages 129–131. John Wiley and Sons Ltd., GBR, January 2003.
- Michael Sipser. Introduction to the theory of computation. Boston : PWS Pub. Co., 1997. URL http://archive.org/details/introductiontoth00sips.