Advanced Detection of Source Code Clones via an Ensemble of Unsupervised Similarity Measures (2405.02095v2)
Abstract: The capability of accurately determining code similarity is crucial in many tasks related to software development. For example, it might be essential to identify code duplicates for performing software maintenance. This research introduces a novel ensemble learning approach for code similarity assessment, combining the strengths of multiple unsupervised similarity measures. The key idea is that the strengths of a diverse set of similarity measures can complement each other and mitigate individual weaknesses, leading to improved performance. Preliminary results show that while Transformers-based CodeBERT and its variant GraphCodeBERT are undoubtedly the best option in the presence of abundant training data, in the case of specific small datasets (up to 500 samples), our ensemble achieves similar results, without prejudice to the interpretability of the resulting solution, and with a much lower associated carbon footprint due to training. The source code of this novel approach can be downloaded from https://github.com/jorge-martinez-gil/ensemble-codesim.
- A systematic review on code clone detection. IEEE access, 7, 86121–86144.
- code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages, 3, 1–29.
- Source code plagiarism detection in an educational context: A literature mapping. In 2021 IEEE Frontiers in Education Conference (FIE) (pp. 1–9). IEEE.
- A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 54, 1937–1967.
- A survey of longest common subsequence algorithms. In Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000 (pp. 39–48). IEEE.
- Breiman, L. (1996). Bagging predictors. Machine learning, 24, 123–140.
- Measuring the semantic similarity of texts. In Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment (pp. 13–18).
- Damashek, M. (1995). Gauging similarity with n-grams: Language-independent categorization of text. Science, 267, 843–848.
- BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. doi:10.18653/v1/n19-1423.
- Codebert: A pre-trained model for programming and natural languages. In T. Cohn, Y. He, & Y. Liu (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (pp. 1536–1547). Association for Computational Linguistics volume EMNLP 2020 of Findings of ACL. URL: https://doi.org/10.18653/v1/2020.findings-emnlp.139. doi:10.18653/V1/2020.FINDINGS-EMNLP.139.
- Scalable detection of semantic clones. In Proceedings of the 30th international conference on Software engineering (pp. 321–330).
- Graphcodebert: Pre-training code representations with data flow. arXiv preprint arXiv:2009.08366, .
- Semantic similarity metrics for evaluating source code summarization. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension (pp. 36–47).
- Best parameter selection of rabin-karp algorithm in detecting document similarity. In 2019 International Conference on Information and Communications Technology (ICOIACT) (pp. 457–461). IEEE.
- On software maintenance process improvement based on code clone analysis. In Product Focused Software Process Improvement: 4th International Conference, PROFES 2002 Rovaniemi, Finland, December 9–11, 2002 Proceedings 4 (pp. 185–197). Springer.
- Horwitz, S. (1990). Identifying the semantic and textual differences between two versions of a program. In Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation (pp. 234–245).
- Deckard: Scalable and accurate tree-based detection of code clones. In 29th International Conference on Software Engineering (ICSE’07) (pp. 96–105). IEEE.
- Do code clones matter? In 2009 IEEE 31st International Conference on Software Engineering (pp. 485–495). IEEE.
- What do pre-trained code models know about code? In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 1332–1336). IEEE.
- Karnalim, O. (2020). Tf-idf inspired detection for cross-language source code plagiarism and collusion. Computer Science, 21.
- Source code plagiarism detection in academia with information retrieval: Dataset and the observation. Informatics in Education, 18, 321–344.
- Syntax trees and information retrieval to improve code similarity detection. In Proceedings of the Twenty-Second Australasian Computing Education Conference (pp. 48–55).
- Karnalim, O. et al. (2021). Explanation in code similarity investigation. IEEE Access, 9, 59935–59948.
- Krinke, J. (2001). Identifying similar code with program dependence graphs. In Proceedings Eighth Working Conference on Reverse Engineering (pp. 301–309). IEEE.
- Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady (pp. 707–710). volume 10.
- Martinez-Gil, J. (2022). A comprehensive review of stacking methods for semantic similarity measurement. Machine Learning with Applications, 10, 100423. doi:10.1016/j.mlwa.2022.100423.
- Martinez-Gil, J. (2023). A comparative study of ensemble techniques based on genetic programming: A case study in semantic similarity assessment. Int. J. Softw. Eng. Knowl. Eng., 33, 289–312. doi:10.1142/S0218194022500772.
- Martinez-Gil, J. (2024). Source code clone detection using unsupervised similarity measures. In P. Bludau, R. Ramler, D. Winkler, & J. Bergsmann (Eds.), Software Quality as a Foundation for Security - 16th International Conference on Software Quality, SWQD 2024, Vienna, Austria, April 23-25, 2024, Proceedings (pp. 21–37). Springer volume 505 of Lecture Notes in Business Information Processing. URL: https://doi.org/10.1007/978-3-031-56281-5_2. doi:10.1007/978-3-031-56281-5\_2.
- A novel method based on symbolic regression for interpretable semantic similarity measurement. Expert Syst. Appl., 160, 113663. doi:10.1016/j.eswa.2020.113663.
- Sustainable semantic similarity assessment. J. Intell. Fuzzy Syst., 43, 6163–6174. URL: https://doi.org/10.3233/JIFS-220137. doi:10.3233/JIFS-220137.
- Source-code similarity detection and detection tools used in academia: a systematic review. ACM Transactions on Computing Education (TOCE), 19, 1–37.
- Source code metrics: A systematic mapping study. Journal of Systems and Software, 128, 164–197.
- A picture is worth a thousand words: Code clone detection based on image similarity. In 12th IEEE International Workshop on Software Clones, IWSC 2018, Campobasso, Italy, March 20, 2018 (pp. 44–50). IEEE Computer Society. URL: https://doi.org/10.1109/IWSC.2018.8327318. doi:10.1109/IWSC.2018.8327318.
- Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of computer programming, 74, 470–495.
- Code clones: Detection and management. Procedia computer science, 132, 718–727.
- Winnowing: local algorithms for document fingerprinting. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (pp. 76–85).
- String matching algorithms and their applicability in various applications. International journal of soft computing and engineering, 1, 218–222.
- Detecting code clones with graph neural network and flow-augmented abstract syntax tree. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER) (pp. 261–271). IEEE.
- Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In IJCAI (pp. 3034–3040).
- Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM international conference on automated software engineering (pp. 87–98).
- Wise, M. J. (1993). String similarity via greedy string tiling and running karp-rabin matching. Online Preprint, Dec, 119, 1–17.
- A similarity metric method of obfuscated malware using function-call graph. Journal of Computer Virology and Hacking Techniques, 9, 35–47.
- Neural detection of semantic code clones via tree-based convolution. In 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC) (pp. 70–80). IEEE.
- Graph similarity scoring and matching. Applied mathematics letters, 21, 86–94.
- A novel neural source code representation based on abstract syntax tree. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE) (pp. 783–794). IEEE.