Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generative Software Engineering (2403.02583v2)

Published 5 Mar 2024 in cs.SE

Abstract: The rapid development of deep learning techniques, improved computational power, and the availability of vast training data have led to significant advancements in pre-trained models and LLMs. Pre-trained models based on architectures such as BERT and Transformer, as well as LLMs like ChatGPT, have demonstrated remarkable language capabilities and found applications in Software engineering. Software engineering tasks can be divided into many categories, among which generative tasks are the most concern by researchers, where pre-trained models and LLMs possess powerful language representation and contextual awareness capabilities, enabling them to leverage diverse training data and adapt to generative tasks through fine-tuning, transfer learning, and prompt engineering. These advantages make them effective tools in generative tasks and have demonstrated excellent performance. In this paper, we present a comprehensive literature review of generative tasks in SE using pre-trained models and LLMs. We accurately categorize SE generative tasks based on software engineering methodologies and summarize the advanced pre-trained models and LLMs involved, as well as the datasets and evaluation metrics used. Additionally, we identify key strengths, weaknesses, and gaps in existing approaches, and propose potential research directions. This review aims to provide researchers and practitioners with an in-depth analysis and guidance on the application of pre-trained models and LLMs in generative tasks within SE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (118)
  1. Advancing Requirements Engineering through Generative AI: Assessing the Role of LLMs. arXiv preprint arXiv:2310.13976 (2023).
  2. Extracting domain models from textual requirements in the era of large language models. In 2023 ACM/IEEE International Conference on Model Driven Engineering Languages and Systems Companion (MODELS-C). IEEE, 580–587.
  3. Program Synthesis with Large Language Models. arXiv:2108.07732 [cs.PL]
  4. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss (Eds.). Association for Computational Linguistics, 65–72.
  5. Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code. arXiv preprint arXiv:2206.01335 (2022).
  6. Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transformations. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 511–521.
  7. MultiPL-E: A Scalable and Polyglot Approach to Benchmarking Neural Code Generation. IEEE Transactions on Software Engineering 49, 7 (2023), 3675–3691. https://doi.org/10.1109/TSE.2023.3267446
  8. CodeT: Code Generation with Generated Tests. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=ktrw68Cmu9c
  9. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]
  10. Tree-to-tree Neural Networks for Program Translation. In Neural Information Processing Systems. https://api.semanticscholar.org/CorpusID:600040
  11. Supersonic: Learning to generate source code optimisations in c/c++. arXiv preprint arXiv:2309.14846 (2023). https://doi.org/10.48550/arXiv.2309.14846
  12. Large language models are zero-shot fuzzers: Fuzzing deep-learning libraries via large language models. In Proceedings of the 32nd ACM SIGSOFT international symposium on software testing and analysis. 423–435.
  13. Toga: A neural method for test oracle generation. In Proceedings of the 44th International Conference on Software Engineering. 2130–2141.
  14. Self-collaboration Code Generation via ChatGPT. arXiv:2304.07590 [cs.SE]
  15. ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation. arXiv:2308.01861 [cs.CL]
  16. What Does It Take to Learn ’Programming Thinking’?. In Proceedings of the First International Workshop on Computing Education Research (Seattle, WA, USA) (ICER ’05). Association for Computing Machinery, New York, NY, USA, 135–142. https://doi.org/10.1145/1089786.1089799
  17. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1469–1481.
  18. F.Dalpiaz. [n. d.].
  19. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 1536–1547.
  20. Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999 (2022).
  21. What Makes Good In-Context Demonstrations for Code Intelligence Tasks with LLMs?. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, 761–773.
  22. Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850 (2022).
  23. Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366 (2020).
  24. Discovering bug patterns in JavaScript. In Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering. 144–156.
  25. Hao He. 2019. Understanding source code comments at large-scale. In Proceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel, and Alessandra Russo (Eds.). ACM, 1217–1219.
  26. Norbert: Transfer learning for requirements classification. In 2020 IEEE 28th International Requirements Engineering Conference (RE). IEEE, 169–179.
  27. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv:2308.10620 [cs.SE]
  28. Summarizing Source Code with Transferred API Knowledge. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18. International Joint Conferences on Artificial Intelligence Organization, 2269–2275. https://doi.org/10.24963/ijcai.2018/314
  29. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436 (2019).
  30. IEC ISO. 2011. IEEE. 29148: 2011-Systems and software engineering-Requirements engineering. Technical Report. Technical report.
  31. H. Liguo J. Cleland-Huang, S. Mazrouee and D. Port. 2007. nfr. https://doi.org/10.5281/zenodo.268542 [Online].
  32. Nishant Jha and Anas Mahmoud. 2019. Mining non-functional requirements from app store reviews. Empirical Software Engineering 24 (2019), 3659–3695.
  33. Impact of code language models on automated program repair. (2023), 1430–1442.
  34. Cure: Code-aware neural machine translation for automatic program repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1161–1173.
  35. On the Evaluation of Neural Code Translation: Taxonomy and Benchmark. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1529–1541. https://doi.org/10.1109/ASE56229.2023.00114
  36. Binary code summarization: Benchmarking chatgpt/gpt-4 and other large language models. arXiv preprint arXiv:2312.09601 (2023).
  37. Defects4J: A database of existing faults to enable controlled testing studies for Java programs. In Proceedings of the 2014 international symposium on software testing and analysis. 437–440.
  38. Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2312–2323.
  39. xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval. arXiv preprint arXiv:2303.03004 (2023). https://doi.org/10.48550/arXiv.2303.03004
  40. Barbara Kitchenham and Stuart Charters. 2007. Guidelines for performing Systematic Literature Reviews in Software Engineering. 2 (01 2007).
  41. Barbara A Kitchenham and Shari L Pfleeger. 2008. Personal opinion surveys. In Guide to advanced empirical software engineering. Springer, 63–92.
  42. Patch generation with language models: Feasibility and scaling behavior. In Deep Learning for Code Workshop.
  43. SPoC: Search-Based Pseudocode to Code. Curran Associates Inc., Red Hook, NY, USA.
  44. Interactive code generation via test-driven user-intent formalization. arXiv preprint arXiv:2208.05950 (2022).
  45. The ManyBugs and IntroClass benchmarks for automated repair of C programs. IEEE Transactions on Software Engineering 41, 12 (2015), 1236–1256.
  46. A neural model for generating natural language summaries of program subroutines. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE). IEEE, 795–806.
  47. CODAMOSA: Escaping coverage plateaus in test generation with pre-trained large language models. In International conference on software engineering (ICSE).
  48. Studying software engineers: Data collection techniques for software field studies. Empirical software engineering 10 (2005), 311–341.
  49. Enabling Programming Thinking in Large Language Models Toward Code Generation. ArXiv abs/2305.06599 (2023). https://api.semanticscholar.org/CorpusID:263896057
  50. SkCoder: A Sketch-based Approach for Automatic Code Generation. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2124–2135. https://doi.org/10.1109/ICSE48619.2023.00179
  51. StarCoder: may the source be with you! arXiv preprint arXiv:2305.06161 (2023). https://doi.org/10.48550/arXiv.2305.06161
  52. Competition-level code generation with alphacode. Science 378, 6624 (2022), 1092–1097.
  53. Do Pretrained Language Models Indeed Understand Software Engineering Tasks? IEEE Transactions on Software Engineering 49, 10 (2023), 4639–4655. https://doi.org/10.1109/TSE.2023.3308952
  54. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  55. QuixBugs: A multi-lingual program repair benchmark set based on the Quixey Challenge. In Proceedings Companion of the 2017 ACM SIGPLAN international conference on systems, programming, languages, and applications: software for humanity. 55–56.
  56. CodeGen4Libs: A Two-Stage Approach for Library-Oriented Code Generation. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 434–445. https://doi.org/10.1109/ASE56229.2023.00159
  57. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021). https://doi.org/10.48550/arXiv.2102.04664
  58. Learning performance-improving code edits. arXiv preprint arXiv:2302.07867 (2023). https://doi.org/10.48550/arXiv.2302.07867
  59. Ehsan Mashhadi and Hadi Hemmati. 2021. Applying codebert for automated program repair of java simple bugs. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR). IEEE, 505–509.
  60. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 336–347.
  61. ClarifyGPT: Empowering LLM-based Code Generation with Intention Clarification. arXiv:2310.10996 [cs.SE]
  62. Testing LLMs on Code Generation with Varying Levels of Prompt Specificity. ArXiv abs/2311.07599 (2023). https://api.semanticscholar.org/CorpusID:265157726
  63. Retrieval-based prompt selection for code-related few-shot learning. In Proceedings of the 45th International Conference on Software Engineering (ICSE’23).
  64. Divide-and-Conquer Approach for Multi-phase Statistical Migration for Source Code (T). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE). 585–596. https://doi.org/10.1109/ASE.2015.74
  65. Deep Learning Meets Software Engineering: A Survey on Pre-Trained Models of Source Code. arXiv:2205.11739 [cs.SE]
  66. Spt-code: Sequence-to-sequence pre-training for learning source code representations. In Proceedings of the 44th International Conference on Software Engineering. 2006–2018.
  67. SteloCoder: a Decoder-Only LLM for Multi-Language to Python Code Translation. arXiv preprint arXiv:2310.15539 (2023).
  68. Understanding the Effectiveness of Large Language Models in Code Translation. arXiv preprint arXiv:2308.03109 (2023). https://doi.org/10.48550/arXiv.2308.03109
  69. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
  70. Can OpenAI’s codex fix bugs? an evaluation on QuixBugs. In Proceedings of the Third International Workshop on Automated Program Repair. 69–75.
  71. From Misuse to Mastery: Enhancing Code Generation with Knowledge-Driven AI Chaining. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 976–987. https://doi.org/10.1109/ASE56229.2023.00143
  72. Automated extraction of conceptual models from user stories via NLP. In 2016 IEEE 24th international requirements engineering conference (RE). IEEE, 196–205.
  73. SourceFinder: Finding Malware Source-Code from Publicly Available Repositories in GitHub. In 23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2020). USENIX Association, San Sebastian, 149–163. https://www.usenix.org/conference/raid2020/presentation/omar
  74. S. Mosser S. Arulmohan and M.-J. Meurs. 2023. ace-design/qualified user-stories: Version 1.0. https://doi.org/10.5281/zenodo.8136975 [Online].
  75. An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. arXiv preprint arXiv:2302.06527 (2023).
  76. Florian Schneider and Brian Berenbach. 2013. A literature survey on international standards for systems requirements engineering. Procedia Computer Science 16 (2013), 796–805.
  77. SoTaNa: The Open-Source Software Development Assistant. arXiv preprint arXiv:2308.13416 (2023).
  78. Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks. arXiv:2310.10508 [cs.SE]
  79. Exploring the Effectiveness of Large Language Models in Generating Unit Tests. arXiv preprint arXiv:2305.00418 (2023).
  80. An Analysis of the Automatic Bug Fixing Performance of ChatGPT. APR workshop.
  81. Chia-Yi Su and Collin McMillan. 2023. Distilled GPT for Source Code Summarization. arXiv preprint arXiv:2308.14731 (2023).
  82. A Prompt Learning Framework for Source Code Summarization. arXiv preprint arXiv:2312.16066 (2023).
  83. Automatic Code Summarization via ChatGPT: How Far Are We? arXiv preprint arXiv:2305.12865 (2023).
  84. Unit test case generation with transformers and focal context. arXiv preprint arXiv:2009.05617 (2020).
  85. Generating accurate assert statements for unit test cases using pretrained transformers. In Proceedings of the 3rd ACM/IEEE International Conference on Automation of Software Test. 54–64.
  86. Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems (New Orleans, LA, USA) (CHI EA ’22). Association for Computing Machinery, New York, NY, USA, Article 332, 7 pages. https://doi.org/10.1145/3491101.3519665
  87. Axel Van Lamsweerde. 2000. Requirements engineering in the year 00: A research perspective. In Proceedings of the 22nd international conference on Software engineering. 5–19.
  88. On the Validity of Pre-Trained Transformers for Natural Language Processing in the Software Engineering Domain. IEEE Transactions on Software Engineering 49, 4 (2023), 1487–1507. https://doi.org/10.1109/TSE.2022.3178469
  89. ReCode: Robustness Evaluation of Code Generation Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 13818–13843. https://doi.org/10.18653/v1/2023.acl-long.773
  90. Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922 (2023).
  91. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 8696–8708.
  92. MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages. arXiv:2203.08388 [cs.CL]
  93. ChatCoder: Chat-based Refine Requirement Improves LLMs’ Code Generation. arXiv preprint arXiv:2311.00272 (2023).
  94. Using ChatGPT throughout the Software Development Life Cycle by Novice Developers. arXiv preprint arXiv:2310.13648 (2023).
  95. On learning meaningful assert statements for unit test cases. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 1398–1409.
  96. Better Together? An Evaluation of AI-Supported Code Translation. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). Association for Computing Machinery, New York, NY, USA, 369–391. https://doi.org/10.1145/3490099.3511157
  97. Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design. arXiv preprint arXiv:2303.07839 (2023).
  98. Robert White and Jens Krinke. 2020. Reassert: Deep learning for assert generation. arXiv preprint arXiv:2011.09784 (2020).
  99. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
  100. Chunqiu Steven Xia and Lingming Zhang. 2022. Less training, more repairing please: revisiting automated program repair via zero-shot learning. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 959–971.
  101. Chunqiu Steven Xia and Lingming Zhang. 2023. Keep the Conversation Going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT. arXiv preprint arXiv:2304.00385 (2023).
  102. Measuring Program Comprehension: A Large-Scale Field Study with Professionals. IEEE Trans. Software Eng. 44, 10 (2018), 951–976.
  103. A Closer Look at Different Difficulty Levels Code Generation Abilities of ChatGPT. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1887–1898. https://doi.org/10.1109/ASE56229.2023.00096
  104. CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation. arXiv preprint arXiv:2311.08588 (2023).
  105. CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation. arXiv:2311.08588 [cs.CL]
  106. CodeTransOcean: A Comprehensive Multilingual Benchmark for Code Translation. arXiv preprint arXiv:2310.04951 (2023). https://doi.org/10.48550/arXiv.2310.04951
  107. Assessing and Improving Syntactic Adversarial Robustness of Pre-trained Models for Code Translation. arXiv preprint arXiv:2310.18587 (2023). https://doi.org/10.48550/arXiv.2310.18587
  108. Neural program repair with execution-based backpropagation. In Proceedings of the 44th International Conference on Software Engineering. 1506–1518.
  109. Llm for test script generation and migration: Challenges, capabilities, and opportunities. arXiv preprint arXiv:2309.13574 (2023).
  110. CIRCLE: Continual repair across programming languages. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 678–690.
  111. Evaluation of Chatgpt on Requirements Information Retrieval Under Zero-Shot Setting. Available at SSRN 4450322 ([n. d.]).
  112. Self-Edit: Fault-Aware Code Editor for Code Generation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 769–787. https://doi.org/10.18653/v1/2023.acl-long.45
  113. ALGO: Synthesizing Algorithmic Programs with Generated Oracle Verifiers. arXiv preprint arXiv:2305.14591 (2023).
  114. Gamma: Revisiting Template-Based Automated Program Repair Via Mask Prediction. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE Computer Society, 535–547.
  115. Towards an Understanding of Large Language Models in Software Engineering Tasks. arXiv:2308.11396 [cs.SE]
  116. A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends. arXiv:2311.10372 [cs.SE]
  117. Xlcost: A benchmark dataset for cross-lingual code intelligence. arXiv preprint arXiv:2206.08474 (2022). https://doi.org/10.48550/arXiv.2206.08474
  118. A syntax-guided edit decoder for neural program repair. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 341–353.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yuan Huang (85 papers)
  2. Yinan Chen (23 papers)
  3. Xiangping Chen (9 papers)
  4. Junqi Chen (8 papers)
  5. Rui Peng (79 papers)
  6. Zhicao Tang (1 paper)
  7. Jinbo Huang (2 papers)
  8. Furen Xu (2 papers)
  9. Zibin Zheng (194 papers)