Unsupervised Translation of Programming Languages (2006.03511v3)

Published 5 Jun 2020 in cs.CL and cs.PL

Abstract: A transcompiler, also known as source-to-source translator, is a system that converts source code from a high-level programming language (such as C++ or Python) to another. Transcompilers are primarily used for interoperability, and to port codebases written in an obsolete or deprecated language (e.g. COBOL, Python 2) to a modern one. They typically rely on handcrafted rewrite rules, applied to the source code abstract syntax tree. Unfortunately, the resulting translations often lack readability, fail to respect the target language conventions, and require manual modifications in order to work properly. The overall translation process is timeconsuming and requires expertise in both the source and target languages, making code-translation projects expensive. Although neural models significantly outperform their rule-based counterparts in the context of natural language translation, their applications to transcompilation have been limited due to the scarcity of parallel data in this domain. In this paper, we propose to leverage recent approaches in unsupervised machine translation to train a fully unsupervised neural transcompiler. We train our model on source code from open source GitHub projects, and show that it can translate functions between C++, Java, and Python with high accuracy. Our method relies exclusively on monolingual source code, requires no expertise in the source or target languages, and can easily be generalized to other programming languages. We also build and release a test set composed of 852 parallel functions, along with unit tests to check the correctness of translations. We show that our model outperforms rule-based commercial baselines by a significant margin.

PDF Abstract

Unsupervised Translation of Programming Languages: An Expert Overview

The paper "Unsupervised Translation of Programming Languages" by Lachaux et al. offers an innovative approach to transcompiling, effectively extending the achievements of neural machine translation (NMT) into the domain of programming languages. The authors propose an unsupervised neural transcompiler, TransCoder, capable of translating functions between C++, Java, and Python using only monolingual source code data from publicly available GitHub repositories. This work marks a departure from traditional rule-based transcompilers, which require extensive manual effort to create handcrafted rewrite rules applied to abstract syntax trees (ASTs).

Key Contributions

Unsupervised Training with Monolingual Data: TransCoder leverages a large corpus of monolingual source code and utilizes unsupervised machine translation techniques. This negates the need for parallel datasets, which are typically scarce in the field of programming languages. The unsupervised nature of this approach allows the model to generalize and potentially support translation across various languages, provided they have sufficient digital footprints.
Cross-Programming Language Pretraining: The research adapts Cross-lingual Masked LLM (XLM) pretraining to align similar constructs across different programming languages, utilizing common syntax elements such as keywords and standard library function names as anchor points. This method lays the foundation for generating semantically consistent translations without explicit parallel training.
Empirical Validation and Data Release: The authors have not only trained their model but also constructed and released a test set of 852 parallel functions, equipped with unit tests to validate the correctness of translations. The model shows significant improvements over commercial rule-based baselines, with enhanced translation quality evidenced by higher computational accuracy over traditional BLEU scores.

Results and Analysis

The model demonstrates promising results, outperforming traditional rule-based transcompilation systems on computational accuracy, leveraging beam search techniques to improve the top-1 candidate correctness significantly. For instance, in Java to Python translations, the computational accuracy increases starkly as the beam size grows, an insight indicative of the model's ability to retrieve correct translations with broader beam widths.

Despite this progress, the paper identifies several areas where improvement is necessary. Many translation errors are rooted in discrepancies not handled by beam search, such as type inference issues or syntax peculiarities between source and target languages. Addressing these through future enhancements in architecture, integrating syntactical constraints, or leveraging compilers' feedback could further increase reliability and performance.

Practical and Theoretical Implications

From a practical standpoint, the authors' approach to unsupervised transcompilation could markedly reduce the cost and expertise required to migrate old codebases to modern languages. Furthermore, it democratizes the translation of code across languages, potentially facilitating more seamless open-source collaboration.

Theoretically, this approach underscores an important shift in NMT from natural languages to the syntactically rigid domain of programming. The cross-lingual pretraining strategy introduced could inspire further research into extending these methods to other structural data representations, improving the semantic richness of machine translations across diverse contexts.

Future Directions

The research opens up numerous avenues for future work. Enhancing the robustness of variable type inference, improving the handling of language-specific APIs, and adapting the model to other widely-used or emergent programming languages could significantly broaden the utility of unsupervised transcompilation.

Additionally, as the domain continues to evolve, the integration of detection and rectification mechanisms for semantic errors and iterative improvement via active learning strategies could bridge existing gaps highlighted in translation failures.

In conclusion, Lachaux et al.'s work presents a significant advancement in the field of code translation, offering an unsupervised method that challenges the existing paradigm of transcompilation. By harnessing the power of neural networks and monolingual learning, the research provides a compelling step toward fully automated codebase evolution and interoperability.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Marie-Anne Lachaux (10 papers)
Lowik Chanussot (5 papers)
Guillaume Lample (31 papers)
Baptiste Roziere (13 papers)

Citations (365)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos