A parallel corpus of Python functions and documentation strings for automated code documentation and code generation (1707.02275v1)

Published 7 Jul 2017 in cs.CL and cs.AI

Abstract: Automated documentation of programming source code and automated code generation from natural language are challenging tasks of both practical and scientific interest. Progress in these areas has been limited by the low availability of parallel corpora of code and natural language descriptions, which tend to be small and constrained to specific domains. In this work we introduce a large and diverse parallel corpus of a hundred thousands Python functions with their documentation strings ("docstrings") generated by scraping open source repositories on GitHub. We describe baseline results for the code documentation and code generation tasks obtained by neural machine translation. We also experiment with data augmentation techniques to further increase the amount of training data. We release our datasets and processing scripts in order to stimulate research in these areas.

Citations (153)

View on Semantic Scholar

Summary

The paper introduces a parallel corpus of over 150K Python functions paired with docstrings to enable automated code documentation and generation.
Baseline experiments using neural machine translation models achieved BLEU scores of 13.84 for code-to-docstring and 10.24 for docstring-to-code tasks.
The study highlights the corpus' potential to drive future research in context-aware and syntax-informed methods for automated programming assistance.

Overview of a Parallel Corpus for Python Code Documentation and Generation

The paper by Miceli Barone and Sennrich presents a notable advancement in developing resources for automated code documentation and code generation. The core contribution is the introduction of a comprehensive parallel corpus composed of over 150,000 Python function definitions paired with their corresponding documentation strings, or docstrings. This corpus is constructed by mining open-source repositories on GitHub. The authors address the limitations of previous datasets by creating a resource that is both large and diverse, aiming to more accurately capture the complexity inherent in these tasks.

Existing Limitations and Proposed Solution

Prior efforts in creating datasets for code documentation have often been constrained by size, noise, or domain specificity. Existing corpora, such as those involving DJANGO functions or trading card games, tend to be either small or structured in ways that do not fully reflect the intricacies of real-world code documentation scenarios. For instance, using pseudo-code instead of natural language descriptions or relying on repetitive and templated code structures can lead to overly simplistic datasets. These limitations render previous datasets artificially easy for model evaluation, as evidenced by unusually high BLEU scores compared to natural language translation tasks.

To overcome these challenges, the authors propose the use of naturally occurring docstrings, which provide rich, context-specific documentation directly from the codebase itself. By leveraging Python's intrinsic support for docstrings, their corpus captures a wide variety of functions and real-world documentation styles.

Baseline Experiments and Observations

The paper reports on baseline experiments conducted using neural machine translation models applied to this new dataset. Interestingly, the BLEU scores achieved—13.84 for code-to-docstring translation and approximately 10.24 for docstring-to-code generation—highlight the task's complexity when examined against previous corpora results. The authors also explore data augmentation techniques using synthetic docstrings generated from unannotated code fragments, which provide modest improvements in performance.

These results underscore the significant difficulty presented by the proposed dataset, suggesting that it offers a more rigorous benchmark for evaluating code documentation and generation systems. The paper concludes that existing translation techniques may not be sufficient, stimulating interest and research efforts to explore more sophisticated and innovative methods.

Implications and Future Directions

The implications of this work are multifold. Practically, the availability of such a corpus can enhance the development of tools integrated into software development environments, offering functionalities such as automatically generated documentation or code snippets from natural language descriptions. Furthermore, the release of this dataset and corresponding scripts lays a valuable foundation for future research, encouraging the exploration of context-aware and syntax-informed approaches to automated code documentation and generation.

Theoretically, this research informs our understanding of the intersection between programming languages and natural language processing. By challenging existing models with a dataset that more faithfully represents human-like code documentation tasks, this work may contribute to advancements in neural architectures and learning paradigms capable of mirroring the nuanced reasoning evident in human programmers.

In conclusion, the introduction of a parallel corpus for Python code documentation signifies a relevant stride toward comprehensive modeling of automated code translation tasks. This resource not only aims to foster innovation in automated programming assistance but may also inspire advancements in the broader AI community regarding complex, multimodal translation tasks.

PDF Markdown

Related Papers

GitHub

GitHub - EdinburghNLP/code-docstring-corpus: Preprocessed Python functions and docstrings for automated code documentation (code2doc) and automated code generation (doc2code) tasks. (209 stars)