Does BLEU Score Work for Code Migration? (1906.04903v1)

Published 12 Jun 2019 in cs.SE and cs.CL

Abstract: Statistical machine translation (SMT) is a fast-growing sub-field of computational linguistics. Until now, the most popular automatic metric to measure the quality of SMT is BiLingual Evaluation Understudy (BLEU) score. Lately, SMT along with the BLEU metric has been applied to a Software Engineering task named code migration. (In)Validating the use of BLEU score could advance the research and development of SMT-based code migration tools. Unfortunately, there is no study to approve or disapprove the use of BLEU score for source code. In this paper, we conducted an empirical study on BLEU score to (in)validate its suitability for the code migration task due to its inability to reflect the semantics of source code. In our work, we use human judgment as the ground truth to measure the semantic correctness of the migrated code. Our empirical study demonstrates that BLEU does not reflect translation quality due to its weak correlation with the semantic correctness of translated code. We provided counter-examples to show that BLEU is ineffective in comparing the translation quality between SMT-based models. Due to BLEU's ineffectiveness for code migration task, we propose an alternative metric RUBY, which considers lexical, syntactical, and semantic representations of source code. We verified that RUBY achieves a higher correlation coefficient with the semantic correctness of migrated code, 0.775 in comparison with 0.583 of BLEU score. We also confirmed the effectiveness of RUBY in reflecting the changes in translation quality of SMT-based translation models. With its advantages, RUBY can be used to evaluate SMT-based code migration models.

Citations (51)

View on Semantic Scholar

Summary

The paper demonstrates that BLEU weakly correlates with semantic accuracy in SMT-based code migration.
It employs a large Java-C# dataset and counter-examples to expose mismatches between high BLEU scores and correct code functionality.
The research introduces RUBY, a metric that integrates lexical, syntactic, and semantic evaluations for more reliable model comparisons.

The paper empirically evaluates the effectiveness of the Bilingual Evaluation Understudy (BLEU) metric for assessing the quality of code migration performed by statistical machine translation (SMT) techniques. It posits that BLEU, which is conventionally used in natural language processing, may not be well-suited for evaluating code migration due to the strict syntactic and semantic requirements of programming languages.

The authors hypothesize that BLEU score is not effective in evaluating migrated source code due to its primary focus on lexical phrase-to-phrase translation, which neglects the syntactic and semantic aspects critical to code functionality. The paper aims to disprove the effectiveness of BLEU by contradicting two necessary conditions: 1) BLEU reflects the semantic accuracy of migrated source code, and 2) BLEU effectively compares the translation quality between SMT-based migration models.

To challenge these conditions, the authors employ counter-examples and specific model selections. For the first condition, they analyze models like lpSMT, which focuses on phrase translation and tends to produce code with high lexical accuracy but potentially incorrect semantics. Conversely, they examine mppSMT, which emphasizes structure and semantic accuracy but may exhibit a wide range of BLEU scores. For the second condition, they introduce an artificial model, p-mppSMT, derived from mppSMT, which generates migrated code with similar BLEU scores but differing program semantics.

The experimental setup involves a dataset of 34,209 pairs of Java and C# methods, which were manually migrated by developers. The SMT-based migration models are applied to these methods, and a subset of 375 randomly-selected pairs is manually evaluated for semantic accuracy, assigning a semantic score based on the functionality similarity between the migrated code and the reference code.

The results indicate a weak correlation between BLEU scores and semantic scores. Specifically, the correlation coefficient is 0.523 for mppSMT and 0.570 for lpSMT. The paper provides examples where high BLEU scores do not correspond to high semantic accuracy, and vice versa, thus invalidating the first necessary condition. Furthermore, the paired t-test conducted on mppSMT and p-mppSMT, which have equivalent BLEU scores, reveals significant differences in their semantic scores.

The authors introduce RUBY, a novel metric that integrates lexical, syntactical, and semantic representations of source code. This multi-level metric evaluates the similarity between migrated code and expected results at different abstraction levels, including lexemes, Abstract Syntax Trees (ASTs), and Program Dependence Graphs (PDGs). The RUBY score is determined by the similarity score at the highest representation level that can be constructed for both versions of the code. The paper demonstrates that RUBY achieves a higher correlation coefficient with the semantic correctness of migrated code (0.775) compared to BLEU (0.583).

The paper details the design of RUBY, including the calculation of string similarity (STS), tree similarity (TRS), and graph similarity (GRS) using string edit distance, tree edit distance, and graph edit distance, respectively. The RUBY metric is defined as:

$RUBY(R, T) = \begin{cases} GRS(R, T), & \text{if PDGs are applicable} \ TRS(R, T), & \text{if ASTs are applicable} \ STS(R, T), & \text{otherwise} \end{cases}$

where:

$R$ is the reference code,
$T$ is the translated code,
$GRS$ is the graph similarity,
$TRS$ is the tree similarity,
$STS$ is the string similarity.

The experimental results on the alternative metrics (STS, TRS, GRS) show that the correlation coefficients with semantic scores increase with higher levels of abstraction. GRS exhibits the highest correlation, followed by TRS and STS. However, GRS and TRS are not always applicable due to limitations in constructing PDGs and ASTs from migrated code.

The effectiveness of RUBY in model comparison is validated through t-tests on mppSMT, p-mppSMT, lpSMT, and GNMT. The results demonstrate that RUBY's assessment of model quality aligns with semantic scores. A further paper using the Random Sample Consensus (RANSAC) algorithm identifies subsets of results where RUBY effectively reflects translation quality and those where it does not.

In summary, the paper concludes that BLEU is ineffective for code migration due to its weak correlation with semantic accuracy and inconsistencies in model comparison. It introduces RUBY as an alternative metric that integrates lexical, syntactic, and semantic representations, offering improved reliability in evaluating SMT-based code migration models.

PDF Markdown

Does BLEU Score Work for Code Migration? (1906.04903v1)

Summary

Related Papers