BugLens: Compiler Bug Deduplication

Updated 2 July 2025

BugLens is a deduplication methodology that leverages bisection to pinpoint failure-inducing commits in compiler testing.
It augments commit localization with optimization-trigger identification to outperform traditional techniques like Tamer and D3.
Its low overhead and general applicability make BugLens a practical tool for reducing manual triage in large-scale, random compiler testing.

BugLens is a deduplication methodology designed to address the persistent problem of duplicate bug reports generated by random compiler testing. The approach is distinguished by its reliance on bisection—specifically, the localization of failure-inducing commits within a version-controlled (e.g., git) compiler—as the principal criterion for deduplication. To minimize false negatives, BugLens augments this process with lightweight identification of bug-triggering optimization passes, forming a combined metric that outperforms established program analysis-based methods such as Tamer and D3 in empirical studies. The method is notable for its simplicity, low overhead, and high generalizability across compiler platforms.

1. Deduplication in Compiler Testing: Problem Context

Random testing is widely adopted in compiler validation, producing extensive numbers of test programs. Many of these programs trigger the same underlying compiler defect, but present as distinct failing test cases due to their syntactic diversity. This proliferation of duplicate reports, especially for miscompilation bugs, leads to considerable redundant manual effort in identifying and triaging unique bugs. Prior approaches to deduplication—such as feature extraction, static analysis, or runtime coverage tracing—are usually coupled with high computational costs and are often tailored to specific compiler backends or minimized test input formats.

BugLens is motivated by the need for a more practical and broadly applicable deduplication approach that operates robustly on large-scale, real-world compiler test data.

2. Methodology: Bisection-Based and Augmented Deduplication

BugLens centers on the use of bisection to localize each bug-triggering test program to its earliest failure-inducing commit:

Bisection range initialization: For each test program, a “known-good” (where the bug is not present) and a “known-buggy” version are specified as the bounds for bisection.
Commit localization: The built-in git bisect process is used to automate binary search within this range, identifying the commit $C_P$ (for program $P$ ) that first induces the failure.
Distance computation: For any pair of test programs $P, Q$ , the distance employed for deduplication is:

$\mathcal{D}_{bisect}(P, Q) = |t(C_P) - t(C_Q)|$

where $t(\cdot)$ is the commit timestamp (or comparable index).

If two failures are localized to different commits, they are presumed to be distinct bugs. This method substantially reduces the reliance on computationally intensive program analysis, enabling deduplication in a lightweight, scalable manner.

To address the limitation that a single commit may introduce multiple distinct bugs (potential false negatives), BugLens incorporates an optimization-level refinement:

Using delta debugging, it derives a binary vector for each test indicating which optimization passes are essential to reproduce the bug.
A refined metric between two tests, $v^\alpha, v^\beta$ , is defined as:

$\mathcal{D}_{opt}(v^\alpha, v^\beta) = \frac{1}{n+1} \sum_{k=0}^n \mathbb{I}(v^\alpha_k \neq v^\beta_k)$

The combined deduplication metric is:

$\mathcal{D}(P, Q) = \mathcal{D}_{bisect}(P, Q) + \mathcal{D}_{opt}(v_P, v_Q)$

where $n$ is the number of optimization passes and $v_k$ are the vector components.

The Furthest-Point-First (FPF) algorithm is then used to prioritize triage, surfacing test programs most likely to reveal unique bugs early.

3. Empirical Evaluation and Results

BugLens was evaluated on four real-world datasets derived from historical GCC and LLVM compiler test corpus, focusing on miscompilation bugs that often evade redundancy detection via signature matching or crash fingerprints.

Key empirical findings include:

On the largest dataset (GCC-4.3.0), BugLens required, on average, examining 175.82 test programs to discover all 29 unique bugs, compared to 489.83 for Tamer and 453.21 for D3—a reduction of 64.11% and 61.21% in effort, respectively.
On average, across all datasets, BugLens saves 26.98% and 9.64% human effort over Tamer and D3, respectively.
The improvement in “wasted effort” (number of triaged test programs per new unique bug) is statistically significant.
BugLens remains effective even without test input minimization, whereas Tamer and D3 are either unapplicable or suffer substantial degradation on unminimized programs.

Dataset	Test Programs	Unique Bugs	BugLens Effort	Tamer Effort	D3 Effort
GCC-4.3.0	1,235	29	175.82	489.83	453.21
GCC-4.4.0	647	11	...	...	...
LLVM-2.8.0	80	5	...	...	...

4. Comparison with State-of-the-Art Techniques

BugLens distinguishes itself from Tamer (coverage-based grouping) and D3 (analysis of static, optimization, and runtime features) in several dimensions:

Principle: Bisection and optimization-trigger mining versus heavy-weight code coverage or static analysis.
Implementation: Requires only standard version control (e.g., git) and compiler build infrastructure.
Prerequisites: No requirement for test input minimization, code coverage instrumentation, or language-specific static analysis frameworks.
Practicality: Shown to work robustly out-of-the-box on large, unminimized test corpora.

BugLens’s use of bisection as the primary deduplication signal exploits an existing, computation-efficient debugging mechanism, thereby enabling automatic, domain-independent deduplication.

5. Generalizability and Practical Considerations

BugLens is applicable to any scenario where:

The target compiler is maintained in a version-controlled repository.
Test programs, whether minimized or not, can be replayed against historical compiler versions.
There is a need to deduplicate test failures across diverse sources (independent fuzzers, differing random seeds, or broad test space explorations).

The method does not assume or require language-specific tools or code coverage tracing. It maintains robust performance in real-world usage without mandatory test minimization, permitting maximally rapid triage. The cost per test case is typically low (0.19 to 5.23 compiler builds per case), especially with caching and pre-built artifact reuse.

A plausible implication is that BugLens's efficiency in triaging reduces developer burden and accelerates feedback cycles in automated compiler testing infrastructures.

6. Limitations and Prospects

Known limitations include:

Version history requirement: BugLens is only usable with compilers developed under a VCS, which may not be the case for some legacy or proprietary compilers.
False negatives: When distinct bugs are co-introduced in a single commit, bisection alone cannot distinguish them—necessitating the optimization-level refinement step.
False positives: In rare cases, distinct test programs may be misclassified as duplicates if sharing the same commit/optimization signature.
Evaluated domains: All datasets are open-source compilers (GCC, LLVM); applicability to domains outside this context awaits further paper.

The authors advise a two-stage workflow in practice: first, apply BugLens for initial deduplication, then minimize only those test programs selected for reporting, followed by post-fix filtering to remove now-redundant cases.

7. Summary Table: BugLens vs. Prior Work

Feature	Tamer/D3	BugLens
Principle	Coverage/Static Analysis	Bisection + Optimization detection
Minimization Required	Yes	No
Generality	Language/toolchain-specific	Tool-agnostic (needs only VCS + compiler)
Human effort per bug	High (~450–490, GCC-4.3.0)	Low (~176, GCC-4.3.0)
Practicality	High overhead	Low overhead, fits existing workflows

Conclusion

BugLens advances compiler bug deduplication by leveraging standard bisection as a deduplication axis and augmenting it with lightweight optimization-triggering analysis to curtail false negatives. This methodology enables effective, efficient, and broadly applicable deduplication across compiler testing workflows, significantly reducing manual effort compared to prior art predicated on code analysis or coverage. Its findings advocate for the adoption of simple, version-control-based techniques as the default baseline for large-scale and domain-general bug deduplication in automated program testing.

PDF Markdown Chat (Upgrade)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now