Learning Natural Coding Conventions (1402.4182v3)

Published 17 Feb 2014 in cs.SE

Abstract: Every programmer has a characteristic style, ranging from preferences about identifier naming to preferences about object relationships and design patterns. Coding conventions define a consistent syntactic style, fostering readability and hence maintainability. When collaborating, programmers strive to obey a project's coding conventions. However, one third of reviews of changes contain feedback about coding conventions, indicating that programmers do not always follow them and that project members care deeply about adherence. Unfortunately, programmers are often unaware of coding conventions because inferring them requires a global view, one that aggregates the many local decisions programmers make and identifies emergent consensus on style. We present NATURALIZE, a framework that learns the style of a codebase, and suggests revisions to improve stylistic consistency. NATURALIZE builds on recent work in applying statistical natural language processing to source code. We apply NATURALIZE to suggest natural identifier names and formatting conventions. We present four tools focused on ensuring natural code during development and release management, including code review. NATURALIZE achieves 94% accuracy in its top suggestions for identifier names and can even transfer knowledge about conventions across projects, leveraging a corpus of 10,968 open source projects. We used NATURALIZE to generate 18 patches for 5 open source projects: 14 were accepted.

Citations (384)

View on Semantic Scholar

Summary

The paper introduces Naturalize, a framework that leverages statistical NLP to infer and enforce coding conventions.
The methodology achieves 94% accuracy for identifier suggestions and 96% for formatting, streamlining code reviews and developer productivity.
The framework supports cross-project learning by transferring best practices across codebases, reducing convention-related review feedback.

Analyzing "Learning Natural Coding Conventions"

The paper "Learning Natural Coding Conventions" presents a robust framework, termed Naturalize, designed to infer and apply the stylistic conventions inherent within a codebase. This research addresses the pervasive issue where developers occasionally diverge from established coding practices, often leading to inefficiencies during code review and integration processes. Utilizing principles from statistical NLP, the framework offers a data-driven solution for maintaining syntactic consistency, particularly in identifier naming and formatting.

Core Contributions

Naturalize introduces a novel approach to coding conventions by framing them in terms of statistical learning rather than static rules. The prevailing view is that conventions are more akin to emergent consensus patterns than legislated constraints. Naturalize, therefore, achieves its goals by leveraging a probabilistic model to capture the 'naturalness' or conformity of coding styles. Key contributions of the paper include:

Framework for Style Consistency: The primary function of Naturalize is to observe a codebase, detect prevailing naming and formatting styles, and propose alterations to harmonize code that deviates from these patterns.
Tools for Developer Productivity: Naturalize's utility is encapsulated in several developer tools such as automated pre-commit checks, Eclipse IDE plugins, and code review assistants that alert developers when their code disrupts a project's stylistic equilibrium.
High Accuracy Rates: The framework achieves a 94% accuracy in top suggestions for identifier names and maintains an average accuracy of 96% for formatting suggestions. This level of precision underscores its practical applicability for software maintenance and its potential to alleviate developer workload by reducing the volume of convention-related review feedback.
Cross-Project Learning Capabilities: The system can transfer conventions between projects by leveraging a large corpus of open-source projects. This cross-learning capability allows developers to infuse best practices and community consensus into their codebases, further promoting consistency.
Empirical Validation: Through empirical studies, the paper asserts the significance developers place on coding conventions. Of interest is the finding that a substantial portion of code review feedback at Microsoft pertains to adherence to such conventions, illustrating the real-world relevance and potential impact of this research.

Practical and Theoretical Implications

Theoretically, the research broadens our understanding of coding conventions within software engineering, viewing them through the lens of probabilistic modeling rather than prescriptive norms. Methodologically, it intersects software engineering with statistical learning, providing a new paradigm for automated software maintenance and readability enhancement.

Practically, Naturalize offers tangible benefits in software development environments by streamlining the revision process for coding style violations and ensuring stylistic uniformity across large and distributed teams. The adoption of this technology could markedly reduce the cognitive load on developers during code integration, leading to improvements in productivity and code quality.

Future Development

The paper hints at future explorations such as adapting Naturalize to other programming languages and integrating it with modern code editing and review tools like Gerrit. Advanced LLMs could further enhance its ability to detect nuanced style conventions and suggest semantically rich identifiers in diverse contexts.

In conclusion, the paper contributes a significant advance in the automation of code style enforcement. The empirical backing and user-acceptance metrics highlight Naturalize's effectiveness and potential as a tool for the software industry's ongoing push towards code quality and maintainability.

PDF Markdown