Evaluation of LLMs on Syntax-Aware Code Fill-in-the-Middle Tasks (2403.04814v3)
Abstract: We introduce Syntax-Aware Fill-In-the-Middle (SAFIM), a new benchmark for evaluating LLMs on the code Fill-in-the-Middle (FIM) task. This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. Our comprehensive evaluation of 15 LLMs shows that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. Our findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at https://github.com/gonglinyuan/safim, and the leaderboard is available at https://safimbenchmark.com.
- JuICe: A large scale distantly supervised dataset for open domain context-based code generation. (arXiv:1910.02216), October 2019. doi: 10.48550/arXiv.1910.02216. URL http://arxiv.org/abs/1910.02216. arXiv:1910.02216 [cs].
- Unified pre-training for program understanding and generation. Apr 2021. doi: 10.48550/arXiv.2103.06333. URL http://arxiv.org/abs/2103.06333. arXiv:2103.06333 [cs].
- SantaCoder: don’t reach for the stars! (arXiv:2301.03988), February 2023. doi: 10.48550/arXiv.2301.03988. URL http://arxiv.org/abs/2301.03988. arXiv:2301.03988 [cs].
- Multi-lingual evaluation of code generation models. (arXiv:2210.14868), March 2023. doi: 10.48550/arXiv.2210.14868. URL http://arxiv.org/abs/2210.14868. arXiv:2210.14868 [cs].
- Program synthesis with large language models. Aug 2021. doi: 10.48550/arXiv.2108.07732. URL http://arxiv.org/abs/2108.07732. arXiv:2108.07732 [cs].
- Efficient training of language models to fill in the middle. (arXiv:2207.14255), July 2022. doi: 10.48550/arXiv.2207.14255. URL http://arxiv.org/abs/2207.14255. arXiv:2207.14255 [cs].
- Language models are few-shot learners. (arXiv:2005.14165), July 2020. doi: 10.48550/arXiv.2005.14165. URL http://arxiv.org/abs/2005.14165. arXiv:2005.14165 [cs].
- MultiPL-E: A scalable and extensible approach to benchmarking neural code generation. (arXiv:2208.08227), December 2022. doi: 10.48550/arXiv.2208.08227. URL http://arxiv.org/abs/2208.08227. arXiv:2208.08227 [cs].
- Evaluating large language models trained on code. Jul 2021. doi: 10.48550/arXiv.2107.03374. URL http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].
- PaLM: Scaling language modeling with pathways. Oct 2022. doi: 10.48550/arXiv.2204.02311. URL http://arxiv.org/abs/2204.02311. arXiv:2204.02311 [cs].
- BERT: Pre-training of deep bidirectional transformers for language understanding. May 2019. doi: 10.48550/arXiv.1810.04805. URL http://arxiv.org/abs/1810.04805. arXiv:1810.04805 [cs].
- CoCoMIC: Code completion by jointly modeling in-file and cross-file context. (arXiv:2212.10007), May 2023. doi: 10.48550/arXiv.2212.10007. URL http://arxiv.org/abs/2212.10007. arXiv:2212.10007 [cs].
- GLM: General language model pretraining with autoregressive blank infilling. (arXiv:2103.10360), March 2022. doi: 10.48550/arXiv.2103.10360. URL http://arxiv.org/abs/2103.10360. arXiv:2103.10360 [cs].
- InCoder: A generative model for code infilling and synthesis. (arXiv:2204.05999), April 2023. doi: 10.48550/arXiv.2204.05999. URL http://arxiv.org/abs/2204.05999. arXiv:2204.05999 [cs].
- AST-T5: Structure-aware pretraining for code generation and understanding. (arXiv:2401.03003), January 2024. doi: 10.48550/arXiv.2401.03003. URL http://arxiv.org/abs/2401.03003. arXiv:2401.03003 [cs].
- DeepSeek-Coder: When the large language model meets programming – the rise of code intelligence. (arXiv:2401.14196), January 2024. doi: 10.48550/arXiv.2401.14196. URL http://arxiv.org/abs/2401.14196. arXiv:2401.14196 [cs].
- Measuring coding challenge competence with APPS. (arXiv:2105.09938), November 2021. doi: 10.48550/arXiv.2105.09938. URL http://arxiv.org/abs/2105.09938. arXiv:2105.09938 [cs].
- Jigsaw: Large language models meet program synthesis. (arXiv:2112.02969), December 2021. doi: 10.48550/arXiv.2112.02969. URL http://arxiv.org/abs/2112.02969. arXiv:2112.02969 [cs].
- SWE-Bench: Can language models resolve real-world Github issues? (arXiv:2310.06770), October 2023. doi: 10.48550/arXiv.2310.06770. URL http://arxiv.org/abs/2310.06770. arXiv:2310.06770 [cs].
- xCodeEval: A large scale multilingual multitask benchmark for code understanding, generation, translation and retrieval. (arXiv:2303.03004), November 2023. doi: 10.48550/arXiv.2303.03004. URL http://arxiv.org/abs/2303.03004. arXiv:2303.03004 [cs].
- The Stack: 3 TB of permissively licensed source code. (arXiv:2211.15533), November 2022. doi: 10.48550/arXiv.2211.15533. URL http://arxiv.org/abs/2211.15533. arXiv:2211.15533 [cs].
- DS-1000: A natural and reliable benchmark for data science code generation. (arXiv:2211.11501), November 2022. doi: 10.48550/arXiv.2211.11501. URL http://arxiv.org/abs/2211.11501. arXiv:2211.11501 [cs].
- StarCoder: may the source be with you! (arXiv:2305.06161), December 2023. doi: 10.48550/arXiv.2305.06161. URL http://arxiv.org/abs/2305.06161. arXiv:2305.06161 [cs].
- Competition-level code generation with AlphaCode. Science, 378(6624):1092–1097, December 2022. ISSN 0036-8075, 1095-9203. doi: 10.1126/science.abq1158. URL http://arxiv.org/abs/2203.07814. arXiv:2203.07814 [cs].
- RepoBench: Benchmarking repository-level code auto-completion systems. (arXiv:2306.03091), October 2023. doi: 10.48550/arXiv.2306.03091. URL http://arxiv.org/abs/2306.03091. arXiv:2306.03091 [cs].
- Codegen: An open large language model for code with multi-turn program synthesis. (arXiv:2203.13474), February 2023. doi: 10.48550/arXiv.2203.13474. URL http://arxiv.org/abs/2203.13474. arXiv:2203.13474 [cs].
- OpenAI. GPT-4 technical report. (arXiv:2303.08774), December 2023. doi: 10.48550/arXiv.2303.08774. URL http://arxiv.org/abs/2303.08774. arXiv:2303.08774 [cs].
- Training language models to follow instructions with human feedback. (arXiv:2203.02155), March 2022. doi: 10.48550/arXiv.2203.02155. URL http://arxiv.org/abs/2203.02155. arXiv:2203.02155 [cs].
- Gorilla: Large language model connected with massive apis. (arXiv:2305.15334), May 2023. doi: 10.48550/arXiv.2305.15334. URL http://arxiv.org/abs/2305.15334. arXiv:2305.15334 [cs].
- Exploring the limits of transfer learning with a unified text-to-text transformer. Jul 2020. doi: 10.48550/arXiv.1910.10683. URL http://arxiv.org/abs/1910.10683. arXiv:1910.10683 [cs, stat].
- CodeBLEU: a method for automatic evaluation of code synthesis. (arXiv:2009.10297), September 2020. doi: 10.48550/arXiv.2009.10297. URL http://arxiv.org/abs/2009.10297. arXiv:2009.10297 [cs].
- Code Llama: Open foundation models for code. (arXiv:2308.12950), August 2023. doi: 10.48550/arXiv.2308.12950. URL http://arxiv.org/abs/2308.12950. arXiv:2308.12950 [cs].
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. (arXiv:2310.11324), October 2023. doi: 10.48550/arXiv.2310.11324. URL http://arxiv.org/abs/2310.11324. arXiv:2310.11324 [cs].
- Repository-level prompt generation for large language models of code. (arXiv:2206.12839), June 2023. doi: 10.48550/arXiv.2206.12839. URL http://arxiv.org/abs/2206.12839. arXiv:2206.12839 [cs].
- Gemini: A family of highly capable multimodal models. (arXiv:2312.11805), December 2023. doi: 10.48550/arXiv.2312.11805. URL http://arxiv.org/abs/2312.11805. arXiv:2312.11805 [cs].
- CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. Sep 2021. doi: 10.48550/arXiv.2109.00859. URL http://arxiv.org/abs/2109.00859. arXiv:2109.00859 [cs].
- CodeT5+: Open code large language models for code understanding and generation. May 2023a. doi: 10.48550/arXiv.2305.07922. URL http://arxiv.org/abs/2305.07922. arXiv:2305.07922 [cs].
- Execution-based evaluation for open-domain code generation. (arXiv:2212.10481), May 2023b. doi: 10.48550/arXiv.2212.10481. URL http://arxiv.org/abs/2212.10481. arXiv:2212.10481 [cs].
- A prompt pattern catalog to enhance prompt engineering with chatgpt. (arXiv:2302.11382), February 2023. doi: 10.48550/arXiv.2302.11382. URL http://arxiv.org/abs/2302.11382. arXiv:2302.11382 [cs].
- A systematic evaluation of large language models of code. (arXiv:2202.13169), May 2022. doi: 10.48550/arXiv.2202.13169. URL http://arxiv.org/abs/2202.13169. arXiv:2202.13169 [cs].
- Rethinking benchmark and contamination for language models with rephrased samples. (arXiv:2311.04850), November 2023. doi: 10.48550/arXiv.2311.04850. URL http://arxiv.org/abs/2311.04850. arXiv:2311.04850 [cs].
- Natural language to code generation in interactive data science notebooks. (arXiv:2212.09248), December 2022. doi: 10.48550/arXiv.2212.09248. URL http://arxiv.org/abs/2212.09248. arXiv:2212.09248 [cs].
- RepoCoder: Repository-level code completion through iterative retrieval and generation. (arXiv:2303.12570), October 2023a. doi: 10.48550/arXiv.2303.12570. URL http://arxiv.org/abs/2303.12570. arXiv:2303.12570 [cs].
- ToolCoder: Teach code generation models to use api search tools. (arXiv:2305.04032), September 2023b. doi: 10.48550/arXiv.2305.04032. URL http://arxiv.org/abs/2305.04032. arXiv:2305.04032 [cs].
- CodeGeeX: A pre-trained model for code generation with multilingual evaluations on humaneval-x. (arXiv:2303.17568), March 2023. doi: 10.48550/arXiv.2303.17568. URL http://arxiv.org/abs/2303.17568. arXiv:2303.17568 [cs].