Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Adverse Effects of Code Duplication in Machine Learning Models of Code (1812.06469v6)

Published 16 Dec 2018 in cs.SE and cs.LG

Abstract: The field of big code relies on mining large corpora of code to perform some learning task. A significant threat to this approach has been recently identified by Lopes et al. (2017) who found a large amount of near-duplicate code on GitHub. However, the impact of code duplication has not been noticed by researchers devising machine learning models for source code. In this work, we explore the effects of code duplication on machine learning models showing that reported performance metrics are sometimes inflated by up to 100% when testing on duplicated code corpora compared to the performance on de-duplicated corpora which more accurately represent how machine learning models of code are used by software engineers. We present a duplication index for widely used datasets, list best practices for collecting code corpora and evaluating machine learning models on them. Finally, we release tools to help the community avoid this problem in future research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Miltiadis Allamanis (40 papers)
Citations (298)

Summary

The Impact of Code Duplication on Machine Learning Models of Source Code

The paper "The Adverse Effects of Code Duplication in Machine Learning Models of Code" by Miltiadis Allamanis addresses a crucial yet often overlooked issue in the domain of machine learning applications on source code—code duplication. It has been previously noted that substantial amounts of near-duplicate code exist on platforms such as GitHub, posing potential risks to the integrity of data-driven approaches in software engineering. This research provides a detailed analysis of how such duplication can skew the performance evaluation of machine learning models tailored for coding tasks and suggests best practices for mitigating these adverse effects.

Key Insights and Results

The paper identifies code duplication as a significant factor that can artificially inflate performance metrics, sometimes up to 100%, when models are trained and tested on duplicate-laden corpora. This inflation is particularly concerning as it can mislead researchers into believing a model is more effective than it would be in practical settings, where such duplicates may not be present.

The paper provides a comprehensive method for deduplicating datasets and evaluates ten publicly available code datasets, revealing notable duplication rates that underline the pervasiveness of the issue. For example, the paper shows that the Concode dataset and several Python datasets exhibit high duplication levels, with tools released to assist in detecting and quantifying these duplicates in future research.

Theoretical and Empirical Validation

The paper explores the theoretical implications of code duplication, framing it as a violation of the i.i.d. (independent and identically distributed) assumption that is critical for machine learning model validity. The investigation uncovers how duplication can introduce bias in both training and testing phases, impacting models' perceived accuracy and generalization ability.

The work involves empirical studies using various machine learning models, including neural LLMs, PHOG, JsNice, and code2vec. These experiments show that model capacity, the nature of the task, and dataset characteristics all influence how duplication affects model evaluation. Overall, larger-capacity models and metrics involving rare code elements like identifiers are most susceptible to duplication-induced biases.

Implications and Future Directions

The adverse impact documented in this paper on machine learning models of source code suggests several avenues for improvement. The author proposes best practices for data collection and dataset management, emphasizing the need for deduplication tailored to the application's true data distribution.

Additionally, the research underscores the need for developing new models and evaluation methodologies that account for the unique evolutionary nature of software development, which inherently leads to correlated data points through code reuse and evolution.

This paper is instrumental in urging the community to consider how duplication affects model evaluation and development in the burgeoning field of machine learning for code. Future work should focus on enhancing model robustness, better integrating deduplication tools, and exploring novel approaches that harness duplication constructively to automate and augment software engineering practices.