Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree (2002.08653v1)

Published 20 Feb 2020 in cs.SE and cs.AI

Abstract: Code clones are semantically similar code fragments pairs that are syntactically similar or different. Detection of code clones can help to reduce the cost of software maintenance and prevent bugs. Numerous approaches of detecting code clones have been proposed previously, but most of them focus on detecting syntactic clones and do not work well on semantic clones with different syntactic features. To detect semantic clones, researchers have tried to adopt deep learning for code clone detection to automatically learn latent semantic features from data. Especially, to leverage grammar information, several approaches used abstract syntax trees (AST) as input and achieved significant progress on code clone benchmarks in various programming languages. However, these AST-based approaches still can not fully leverage the structural information of code fragments, especially semantic information such as control flow and data flow. To leverage control and data flow information, in this paper, we build a graph representation of programs called flow-augmented abstract syntax tree (FA-AST). We construct FA-AST by augmenting original ASTs with explicit control and data flow edges. Then we apply two different types of graph neural networks (GNN) on FA-AST to measure the similarity of code pairs. As far as we have concerned, we are the first to apply graph neural networks on the domain of code clone detection. We apply our FA-AST and graph neural networks on two Java datasets: Google Code Jam and BigCloneBench. Our approach outperforms the state-of-the-art approaches on both Google Code Jam and BigCloneBench tasks.

Authors (5)

Wenhan Wang (22 papers)
Ge Li (213 papers)
Bo Ma (61 papers)
Xin Xia (171 papers)
Zhi Jin (160 papers)

Citations (226)

View on Semantic Scholar

Summary

Detecting Code Clones with Graph Neural Network and Flow-Augmented Abstract Syntax Tree

The paper introduces a novel approach for detecting code clones by employing Graph Neural Networks (GNN) and Flow-Augmented Abstract Syntax Trees (FA-AST). This research targets the critical challenge of identifying semantic code clones, which are program fragments with similar functionality but potentially divergent syntax. The paper's approach integrates syntactic and semantic information to improve clone detection capabilities.

Analyzing code for syntactic and semantic similarities is foundational to code clone detection and helps reduce software maintenance costs and prevent bug propagation. Traditional methods primarily focus on syntactic similarities, leaving semantic detection relatively underdeveloped. In response, this paper proposes leveraging GNNs with FA-AST to capture these nuances better, considering both data and control flow alongside the conventional AST structure.

The research constructs a hybrid representation for code, termed FA-AST, integrating AST nodes with additional control and data flow edges. The choice of GNNs, specifically Gated Graph Neural Networks (GGNN) and Graph Matching Networks (GMN), is to effectively model these enriched structures. These algorithms compute vector representations of code fragments, which serve as the basis for determining similarity scores between code pairs.

The experiments are conducted on two well-known datasets: Google Code Jam (GCJ) and BigCloneBench. The results indicate a substantial improvement over existing methods, especially in detecting Type-4 clones (semantic clones). On the GCJ dataset, the proposed method achieves an F1-score of 0.98 using the FA-AST approach combined with GMN. The performance on BigCloneBench also demonstrates clear superiority, especially regarding recall and F1-scores, establishing the method's comprehensive competence in capturing semantic similarities that evade traditional methods.

The paper discusses the implications of applying GNNs to code clone detection, specifically the advantages of integrating control and data flow insights that enhance the depth of semantic analysis. The research highlights the potential for extending graph-based representations to other programming languages, suggesting a path for broader applicability and improvements in code comprehension tools.

Future research could explore further enhancing the GNN's architecture or combining this approach with token sequences or other program analysis techniques to gain more comprehensive insights. Additionally, expanding this methodology to handle more extensive datasets and diverse programming languages could establish a new standard in semantic code clone detection. This research underlines the potential and versatility of using advanced machine learning techniques, like GNN, in addressing complex software engineering challenges.

PDF Markdown

Related Papers

Find Related Papers