AI Coders Are Among Us: Rethinking Programming Language Grammar Towards Efficient Code Generation (2404.16333v2)
Abstract: AI models have emerged as another important audience for programming languages alongside humans and machines, as we enter the era of LLMs. LLMs can now perform well in coding competitions and even write programs like developers to solve various tasks, including mathematical problems. However, the grammar and layout of current programs are designed to cater the needs of human developers -- with many grammar tokens and formatting tokens being used to make the code easier for humans to read. While this is helpful, such a design adds unnecessary computational work for LLMs, as each token they either use or produce consumes computational resources. To improve inference efficiency and reduce computational costs, we propose the concept of AI-oriented grammar. This aims to represent code in a way that better suits the working mechanism of AI models. Code written with AI-oriented grammar discards formats and uses a minimum number of tokens to convey code semantics effectively. To demonstrate the feasibility of this concept, we explore and implement the first AI-oriented grammar for Python, named SimPy. SimPy is crafted by revising the original Python grammar through a series of heuristic rules. Programs written in SimPy maintain identical AST structures to those in standard Python. This allows for not only execution via a modified AST parser, but also seamless transformation between programs written in Python and SimPy, enabling human developers and LLMs to use Python and SimPy, respectively, when they need to collaborate. In the experiments, compared with Python, SimPy enables a reduction in token usage by 13.5% and 10.4% for CodeLlama and GPT-4, respectively, when completing the same set of code-related tasks. Additionally, these models can maintain or even improve their performance when using SimPy instead of Python for these tasks.
- DeepSeek AI. 2023. DeepSeek Coder: Let the Code Write Itself. https://github.com/deepseek-ai/DeepSeek-Coder.
- SantaCoder: don’t reach for the stars! arXiv preprint arXiv:2301.03988 (2023).
- Google DeepMind AlphaCode Team. 2023. AlphaCode 2 Technical Report. https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning. PMLR, 2397–2430.
- AutoFocus: Interpreting Attention-Based Neural Networks by Code Perturbation. 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) (2019), 38–41. https://api.semanticscholar.org/CorpusID:208877064
- A Theory of Dual Channel Constraints. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results (Seoul, South Korea) (ICSE-NIER ’20). Association for Computing Machinery, New York, NY, USA, 25–28. https://doi.org/10.1145/3377816.3381720
- Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain
- Evaluating Large Language Models Trained on Code. ArXiv abs/2107.03374 (2021). https://api.semanticscholar.org/CorpusID:235755472
- CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure. In Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:252780909
- Nadezhda Chirkova and Sergey Troshin. 2023. CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code. ArXiv abs/2308.00683 (2023). https://api.semanticscholar.org/CorpusID:252600018
- Trends in AI inference energy consumption: Beyond the performance-vs-parameter laws of deep learning. Sustainable Computing: Informatics and Systems 38 (2023), 100857.
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. ArXiv abs/2002.08155 (2020). https://api.semanticscholar.org/CorpusID:211171605
- Seymour Ginsburg and Joseph S. Ullian. 1966. Ambiguity in context free languages. J. ACM 13 (1966), 62–89. https://api.semanticscholar.org/CorpusID:5851601
- Google. 2023. ChatGPT. https://bard.google.com/.
- Dick Grune and Ceriel J. H. Jacobs. 2007. Parsing Techniques - A Practical Guide. In Monographs in Computer Science. https://api.semanticscholar.org/CorpusID:33077869
- Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE]
- https://www.wsj.com/tech/ai/ais-costly-buildup-could-make-early-products-a-hard-sell-bdd29b9f [n. d.].
- Program Translation via Code Distillation. ArXiv abs/2310.11476 (2023). https://api.semanticscholar.org/CorpusID:264289043
- Scaling Laws for Neural Language Models. ArXiv abs/2001.08361 (2020). https://api.semanticscholar.org/CorpusID:210861095
- Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code. 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE) (2020), 1073–1085. https://api.semanticscholar.org/CorpusID:211161525
- The Stack: 3 TB of permissively licensed source code. Preprint (2022).
- Bernard Lang. 1974. Deterministic Techniques for Efficient Non-Deterministic Parsers. In International Colloquium on Automata, Languages and Programming. https://api.semanticscholar.org/CorpusID:27069587
- StarCoder: may the source be with you! ArXiv abs/2305.06161 (2023). https://api.semanticscholar.org/CorpusID:258588247
- WizardCoder: Empowering Code Large Language Models with Evol-Instruct. ArXiv abs/2306.08568 (2023). https://api.semanticscholar.org/CorpusID:259164815
- Is Self-Attention Powerful to Learn Code Syntax and Semantics? ArXiv abs/2212.10017 (2022). https://api.semanticscholar.org/CorpusID:254877330
- Anthony Moi and Nicolas Patry. 2023. HuggingFace’s Tokenizers. https://github.com/huggingface/tokenizers
- CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In International Conference on Learning Representations. https://api.semanticscholar.org/CorpusID:252668917
- OpenAI. 2023a. ChatGPT. https://chat.openai.com/.
- OpenAI. 2023b. GPT-3.5. https://platform.openai.com/docs/models/gpt-3-5.
- OpenAI. 2023c. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). https://api.semanticscholar.org/CorpusID:257532815
- Tim Peters. 2023. PEP 20 – The Zen of Python. https://peps.python.org/pep-0020/.
- Python. 2023. Full Grammar specification. https://docs.python.org/3/reference/grammar.html.
- Understanding neural code intelligence through program simplification. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2021). https://api.semanticscholar.org/CorpusID:235359051
- Syntax-guided program reduction for understanding neural code intelligence models. Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (2022). https://api.semanticscholar.org/CorpusID:249191662
- Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
- Repilt. 2023. ReplitLM. https://github.com/replit/replitLM.
- Code Llama: Open Foundation Models for Code. ArXiv abs/2308.12950 (2023). https://api.semanticscholar.org/CorpusID:261100919
- Neural Machine Translation of Rare Words with Subword Units. ArXiv abs/1508.07909 (2015). https://api.semanticscholar.org/CorpusID:1114678
- Structural-semantics Guided Program Simplification for Understanding Neural Code Intelligence Models. Proceedings of the 14th Asia-Pacific Symposium on Internetware (2023). https://api.semanticscholar.org/CorpusID:263672536
- Significant Gravitas. [n. d.]. AutoGPT. https://github.com/Significant-Gravitas/AutoGPT
- Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023). https://api.semanticscholar.org/CorpusID:259950998
- tree sitter. 2023. tree-sitter/tree-sitter: An incremental parsing system for programming tools. https://github.com/tree-sitter/tree-sitter.
- Sergey Troshin and Nadezhda Chirkova. 2022. Probing Pretrained Models of Source Codes. ArXiv abs/2202.08975 (2022). https://api.semanticscholar.org/CorpusID:246996634
- PEP 8 – Style Guide for Python Code. https://peps.python.org/pep-0008/.
- What Do They Capture? - A Structural Analysis of Pre-Trained Language Models for Source Code. 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) (2022), 2377–2388. https://api.semanticscholar.org/CorpusID:246823289
- CodeT5+: Open Code Large Language Models for Code Understanding and Generation. ArXiv abs/2305.07922 (2023). https://api.semanticscholar.org/CorpusID:258685677
- CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. ArXiv abs/2109.00859 (2021). https://api.semanticscholar.org/CorpusID:237386541
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
- Robustness, Security, Privacy, Explainability, Efficiency, and Usability of Large Language Models for Code. ArXiv abs/2403.07506 (2024). https://api.semanticscholar.org/CorpusID:268364103
- An Extensive Study on Pre-Trained Models for Program Understanding and Generation. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual, South Korea) (ISSTA 2022). Association for Computing Machinery, New York, NY, USA, 39–51. https://doi.org/10.1145/3533767.3534390
- TinyLlama: An Open-Source Small Language Model. arXiv:2401.02385 [cs.CL]
- Diet code is healthy: simplifying programs for pre-trained models of code. Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2022). https://api.semanticscholar.org/CorpusID:250113729
- Probing model signal-awareness via prediction-preserving input minimization. Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (2020). https://api.semanticscholar.org/CorpusID:227227733