Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code

Published 14 Nov 2023 in cs.CL, cs.AI, and cs.SE | (2311.07989v7)

Abstract: In this work we systematically review the recent advancements in software engineering with LLMs, covering 70+ models, 40+ evaluation tasks, 180+ datasets, and 900 related works. Unlike previous works, we integrate software engineering (SE) with NLP by discussing the perspectives of both sides: SE applies LLMs for development automation, while NLP adopts SE tasks for LLM evaluation. We break down code processing models into general LLMs represented by the GPT family and specialized models that are specifically pretrained on code, often with tailored objectives. We discuss the relations and differences between these models, and highlight the historical transition of code modeling from statistical models and RNNs to pretrained Transformers and LLMs, which is exactly the same course that had been taken by NLP. We also go beyond programming and review LLMs' application in other software engineering activities including requirement engineering, testing, deployment, and operations in an endeavor to provide a global view of NLP in SE, and identify key challenges and potential future directions in this domain. We keep the survey open and updated on GitHub at https://github.com/codefuse-ai/Awesome-Code-LLM.

Abstract PDF HTML Upgrade to Chat

References (694)

Citations (40)

View on Semantic Scholar

Summary

The paper presents a comprehensive review of over 50 language models that bridge NLP and software engineering.
It details the evolution from statistical models to pretrained Transformers using code-specific features such as AST and CFG.
It highlights 30+ evaluation tasks and discusses emerging challenges and future benchmarks for AI-driven code processing.

Survey of LLMs for Code: Bridging NLP and Software Engineering

The integration of NLP with Software Engineering (SE) has led to notable advancements in code processing using LLMs. This paper provides a comprehensive review of over 50 models and associated methodologies, centralizing on the interplay between NLP-based general LLMs, such as the GPT family, and specialized code-pretrained models.

Historical Evolution and Model Architectures

The paper delineates the evolution from statistical models and RNNs to modern pretrained Transformers. LLMs have progressively incorporated code-specific constructs, transitioning through platforms like Codex and resulting in tools such as GitHub Copilot. The work categorizes LLMs for code into general-purpose models and those explicitly pretrained on code, highlighting the nuanced divergences between them.

Code-Specific Features and Methodologies

The authors explore code-specific features like Abstract Syntax Trees (AST), Control Flow Graphs (CFG), and unit tests, exploring their integration into LLM training. Special attention is given to explaining how these features uniquely contribute to understanding and generating code, contrasting methodologies tailored specifically for software versus those adapted from NLP paradigms.

Evaluation Tasks and Benchmarks

The survey presents an overview of 30+ evaluation tasks, categorizing them based on the input-output modalities like text-to-code and code-to-text. It provides insights into how tasks have transitioned from code understanding, like clone detection and defect analysis, to generation tasks, focusing on the most pivotal evaluation metrics such as pass@k for validity in real-world applications. The paper also highlights the significance of repository-level evaluations, enabling a broader context for accountability in code generation and refinement tasks.

Emerging Challenges and Future Directions

The analysis within the paper surfaces critical challenges, particularly the ever-evolving need for extensive datasets and diverse benchmarks. The study posits that large-scale, real-world code benchmarks could facilitate better benchmarking and standardization across the discipline. Additionally, the authors speculate on the integration of LLMs into entire software development cycles, extending their utility beyond singular tasks to encompass holistic software engineering practices.

Implications and Theoretical Considerations

The implications of this review span both theoretical and practical domains. Theoretically, it offers a scaffold for further exploration into effective model architectures and pretraining objectives that harness both text and code. Practically, it guides the creation of robust models capable of empathizing with the intricate syntactic and semantic nuances of various programming languages.

Conclusion

Bridging NLP and software engineering, this survey offers a panoramic view of how LLMs have transformed code processing. It underscores continuous improvement and integration efforts, pushing towards increasingly sophisticated AI-driven software development workflows. Future developments may hinge on enhanced collaboration across disciplines, leveraging insights as technology blurs the boundaries between traditional programming and AI-augmented solutions.

The findings and analyses put forth in this paper offer a roadmap for future advancements in AI-driven code processing and pose intriguing questions for subsequent research and application in both NLP and SE domains.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (8)

Collections

GitHub

GitHub - codefuse-ai/Awesome-Code-LLM: A curated list of language modeling researches for code and related datasets. (1,436 stars)

Tweets

HackerNews

Curated list of language modeling researches for code, plus related datasets (5 points, 0 comments)

Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code

Summary

Survey of LLMs for Code: Bridging NLP and Software Engineering

Historical Evolution and Model Architectures

Code-Specific Features and Methodologies

Evaluation Tasks and Benchmarks

Emerging Challenges and Future Directions

Implications and Theoretical Considerations

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (8)

Collections

GitHub

Tweets

HackerNews