Empirical Analysis of Approximate Clones

Updated 4 February 2026

The paper provides a detailed taxonomy and empirical metrics for approximate clones by quantifying similarity thresholds and fault rates across diverse domains.
It employs token-based, AST, and ordinal comparison methodologies to analyze near-miss code fragments and electoral candidate similarities effectively.
The study offers practical recommendations, including IDE integration and threshold calibration, to manage clone inconsistencies and mitigate associated risks.

Approximate clones—code fragments or objects which are derived from a common ancestor but differ by small modifications—are a persistent phenomenon in multiple domains, including software engineering, scientific computing platforms, quantum programming, and even social choice theory. These entities are known under several synonyms: type-3 (or “near-miss” or “inconsistent”) clones in code, and nearly-adjacent alternatives in voting systems. Empirical studies spanning large-scale open source repositories, quantum circuit libraries, and real-world elections have revealed the prevalence, measurable impacts, and management challenges posed by approximate clones.

1. Formal Definitions and Taxonomy

The canonical framework for classifying clones, due to Roy and Cordy, distinguishes among four types:

Type-1 (Exact) clones: Identical code fragments modulo whitespace and comments.
Type-2 (Renamed) clones: Fragments identical except for systematic renaming of identifiers, literals, or types.
Type-3 (Approximate, "Near-miss", or "Inconsistent" clones): Copied fragments subsequently edited via statement additions, deletions, or minor reorderings—bounded by an edit distance or a thresholded similarity metric. Typical token-based similarity requirements range from 0.7–0.8, e.g.,

$S(A,B) = \frac{\lvert Tokens(A) \cap Tokens(B) \rvert}{\max(\lvert Tokens(A) \rvert, \lvert Tokens(B) \rvert)} \geq 0.75 \text{ or } 0.80$

Type-4 (Semantic) clones: Syntactically divergent but semantically equivalent fragments.

Empirical research consistently operationalizes approximate clones as those satisfying a relaxed similarity threshold after normalization and tokenization, e.g., $S(A,B) \geq 0.75$ (Java, method-level (Golubev et al., 2021)) or $\delta(A,B) \geq 0.80$ (Python code cells (Källén et al., 2020)). In voting theory, approximate clones are candidates whose ranking adjacency is violated on only a small fraction of ballots or requires few swaps per voter to restore (Delemazure, 28 Jan 2026).

2. Detection Methodologies and Metrics

Detection of approximate clones leverages token-based or normalized string similarity, with differing implementations adapted to granularity and context:

Software repositories: SourcererCC and CCFinderSW are standard, extracting tokens after comment/whitespace removal and comparing all candidate pairs for similarity above a set threshold (e.g., T=0.75 (Golubev et al., 2021), $\theta=0.70$ (Manoku et al., 11 Jan 2025)).
Notebooks: Cell-level tokens are compared via set overlap with an 80% threshold (Källén et al., 2020).
Quantum software: Abstract syntax tree extraction and difflib-based similarity, with thresholding for Type-2/3 designation (Manoku et al., 11 Jan 2025).
Social choice: Ordinal distances between candidates in voters' rankings are computed with the
- $\alpha$ -deletion distance: $D^{\rm del}_{x,y} = n^{-1} |\{i: |\sigma_i(x)-\sigma_i(y)|>1\}|$
- $\beta$ -swap distance: $D^{\rm swap}_{x,y} = n^{-1} \sum_{i=1}^n (|\sigma_i(x)-\sigma_i(y)|-1)$ (Delemazure, 28 Jan 2026).

Metrics for prevalence, fault rates, clone maintenance (commit co-change ratios), and consequences are rigorously quantified:

Clone prevalence ratio: $RI = |IC| / |C|$ (Juergens et al., 2017)
Fault rate in type-3 clones: $\mathrm{FaultRate}_{T3} = \lvert C^{T3}_F\rvert/\lvert C^{T3}\rvert$ (Wagner et al., 2016)
Commit and co-change ratios: $f_i = n_i/T_i$ , $R_{co} = C_{co} / C_{total}$ (Yokomori et al., 2024)

3. Empirical Results in Software Systems

Large-Scale Repository Studies

Across tens of thousands of Java projects, approximately 52% of method-level clone groups are approximate (Type-3), and only 35.4% of nontrivial methods have no clone at all (Golubev et al., 2021). In data-science notebooks, 79.7% of Python code snippets are in at least one approximate clone group; nearly half of all notebooks contain no unique code snippets (Källén et al., 2020).

Fault Induction and Maintenance Dynamics

In production Java and C# systems, approximately 17% of type-3 clone groups exhibit documented faults; among unintentional inconsistencies, this figure rises to about 50% (Wagner et al., 2016, Juergens et al., 2017). Fault densities in inconsistent regions can reach 43–91 faults/kLOC, far above typical averages (Juergens et al., 2017). Nevertheless, developer clone-awareness can mitigate risk (e.g., systems with IDE-integrated clone warnings have lower fault rates in type-3 clones (Wagner et al., 2016)).

In practice, modifications to approximate clones are performed in lockstep only ~53% of the time; of these, 10–20% involve potentially inconsistent (low-similarity) edits, leading to a situation where 35–65% of clone pairs in the latest version are "concerning," i.e., have drifted apart or had misaligned edits (Yokomori et al., 2024). This suggests that, even though approximate clones are relatively stable, change propagation is imperfect.

Clone Granularity and Evolution

Method-level approximate clones are over an order of magnitude more prevalent as indicators of inter-project code reuse than file-level exact clones (Golubev et al., 2021). Only 2.3% of exact method clones are contained within exact file clones, underscoring the need for fine-grained studies.

Auto-generated methods often inflate apparent clone counts and consist mainly of near-miss (not exact) clones; filtering by code size or temporal artifact is necessary to avoid skew (Golubev et al., 2021).

4. Domain-Specific Perspectives

Quantum and Scientific Computing

In the quantum software domain (notably Qiskit-based systems), approximate (Type-2/3) clones arise with higher frequency than in classical codebases: 20.5% of repositories exhibit Type-2/3 clones with a global code density $\sim$ 8% (Manoku et al., 11 Jan 2025). These clones typically reflect repeated circuit construction, minor measurement changes, or adaptation across experiments. Fragment sizes are small (average $\bar s_{T23}=26.0$ tokens). Recommendations include domain-adapted clone detection integrating AST-based circuit structure and more expressive, semantically aware metrics (Manoku et al., 11 Jan 2025).

In quantum information itself, approximate clones constitute a physical limit: quantum telecloning circuits can produce $M=9$ universal symmetric clones per input with theoretical mean fidelity $F_\mathrm{opt}=19/27 \approx 0.7037$ ; actual device realizations currently yield $\bar F \approx 0.59$ due to hardware error, establishing both a fundamental bound and a benchmark for quantum NISQ platforms (Pelofske et al., 2022).

In elections, approximate clones are candidates nearly—but not perfectly—adjacent in most voters' rankings. Empirical analyses reveal that even a 10–20% deviation from perfect adjacency undermines the "independence of clones" property for many voting rules (e.g., IRV, Ranked Pairs), with strong clone-independence rates dropping from 100% to 80–90% as $\alpha$ (fraction of violating voters) rises from 0 to 0.2 (Delemazure, 28 Jan 2026). In structured domains (e.g., figure-skating judging), approximate clones are common and can affect outcomes when voting rules were previously believed to be robust only to perfect clones.

5. Tools, Management, and Practical Implications

Detection of approximate clones requires threshold calibration and careful handling of both false positives and negatives. Practice-oriented recommendations include:

Integrating lightweight, clone-aware warnings at commit time, utilizing patch-similarity metrics or ML models trained on prior co-change patterns (Yokomori et al., 2024).
Embedding clone-detection signals into IDE workflows to highlight cross-clone change dependencies (Wagner et al., 2016).
Adapting clone-detection thresholds and features to domain semantic needs, e.g., AST-based comparison in quantum circuits (Manoku et al., 11 Jan 2025).
Systematic refactoring of high-frequency approximate code clones into shared utilities, especially in exploratory scientific notebook environments (Källén et al., 2020).

For social choice, the erosion of clone-independence under approximate cloning underscores the need for rule innovation or supplementary safeguards when near-duplicate candidates may be present (Delemazure, 28 Jan 2026).

6. Outstanding Challenges and Research Directions

Open questions and threats to validity pervade the empirical study of approximate clones:

Sensitivity of detection algorithms to similarity thresholds ( $T$ , $\theta$ ) and edit-distance cutoffs; optimal parameter selection remains context-dependent (Wagner et al., 2016, Juergens et al., 2017).
The generalizability of conclusions across programming languages, problem domains, and software maturity levels (Wagner et al., 2016, Manoku et al., 11 Jan 2025).
Effectiveness of proposed management tooling in reducing developer cognitive load without introducing excessive interruptions (Wagner et al., 2016).
Domain-specific semantic nuance, especially in quantum and scientific programming, where shallow textual similarity may not capture functional equivalence or meaningful divergence (Manoku et al., 11 Jan 2025).
In social choice, the search for voting rules that are robust to near—but not perfect—clones, and the operationalization of clone-proximity as a design axis for electoral systems (Delemazure, 28 Jan 2026).

In all surveyed domains, the empirical study of approximate clones reveals their ubiquity, measurable impact on reliability and maintainability, and the necessity for specialized tools and theoretical refinements to manage their consequences effectively.