Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SourcererCC: Scaling Code Clone Detection to Big Code (1512.06448v1)

Published 20 Dec 2015 in cs.SE

Abstract: Despite a decade of active research, there is a marked lack in clone detectors that scale to very large repositories of source code, in particular for detecting near-miss clones where significant editing activities may take place in the cloned code. We present SourcererCC, a token-based clone detector that targets three clone types, and exploits an index to achieve scalability to large inter-project repositories using a standard workstation. SourcererCC uses an optimized inverted-index to quickly query the potential clones of a given code block. Filtering heuristics based on token ordering are used to significantly reduce the size of the index, the number of code-block comparisons needed to detect the clones, as well as the number of required token-comparisons needed to judge a potential clone. We evaluate the scalability, execution time, recall and precision of SourcererCC, and compare it to four publicly available and state-of-the-art tools. To measure recall, we use two recent benchmarks, (1) a large benchmark of real clones, BigCloneBench, and (2) a Mutation/Injection-based framework of thousands of fine-grained artificial clones. We find SourcererCC has both high recall and precision, and is able to scale to a large inter-project repository (250MLOC) using a standard workstation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Hitesh Sajnani (5 papers)
  2. Vaibhav Saini (6 papers)
  3. Jeffrey Svajlenko (3 papers)
  4. Chanchal K. Roy (55 papers)
  5. Cristina V. Lopes (16 papers)
Citations (515)

Summary

  • The paper introduces a token-based clone detection tool that efficiently scales to hundreds of millions of lines of code using an optimized inverted index and filtering heuristics.
  • The approach achieves high recall and precision by effectively detecting near-miss clones, even in extensive inter-project repositories.
  • The tool employs two filtering heuristics to reduce candidate comparisons, significantly cutting computational costs and improving detection efficiency.

SourcererCC: Scaling Code Clone Detection to Big Code

The paper presents SourcererCC, an innovative token-based clone detection tool designed to effectively scale code clone detection in large repositories, particularly focusing on near-miss clones. The tool targets three main clone types, incorporating an optimized inverted-index and specific filtering heuristics to achieve scalability, even for extensive inter-project repositories.

Key Contributions

SourcererCC addresses the lack of scalable tools for large-scale clone detection, a pressing issue given the growth of open-source projects. The primary contributions include:

  1. Scalability: By leveraging an optimized inverted-index and filtering heuristics, SourcererCC can scale to hundreds of millions of lines of code (LOC) on a single standard workstation. It effectively handles a 250MLOC repository within 4.5 days, outperforming several state-of-the-art tools.
  2. Accurate Detection: The paper reports high recall and precision in near-miss clone detection, thus supporting the identification of varied editing activities in reused code blocks. Such capability is crucial given that Type-3 clones, often resulting from significant edits, are prevalent but challenging to detect.
  3. Efficient Filtering: Two filtering heuristics are employed. The first reduces candidate comparisons using a sub-block filtering approach, and the second employs token position to further minimize unnecessary comparisons. These heuristics significantly lower computational costs and improve detection efficiency.
  4. Comprehensive Benchmarking: Through rigorous evaluation using two benchmarks—BigCloneBench and The Mutation and Injection Framework—SourcererCC demonstrates exceptional recall and precision, consistently outperforming or matching other tools in various experiments.

Implications and Future Directions

The implications of SourcererCC extend into both academic and practical aspects of software engineering. The ability to accurately detect and manage clones can lead to enhanced software maintenance, reduced bugs, and improved software quality. From a practical standpoint, the tool's scalability and efficiency offer substantial benefits for real-world applications, ranging from open-source ecosystems to industrial software management.

Concerning future developments in AI, particularly with advancements in machine learning for code analysis, integrating intelligent methods to evolve SourcererCC's heuristics could yield further advancements in clone detection. The intersection of AI and large-scale software analytics holds potential for even more sophisticated and automated code management systems.

Overall, SourcererCC's methodology exemplifies a significant step forward in addressing the complexities of large-scale clone detection, offering a robust framework for managing the growing needs of software ecosystems. Further research may focus on optimizing the detection of more complex Type-3 and evolving Type-4 clones, possibly through the integration of AI-driven techniques.