Elixir: Effective OO Program Repair
- The paper introduces a novel generate-and-validate approach that expands the repair-expression space using aggressive method invocation synthesis.
- ELIXIR employs an expressive repair-expression language and a machine-learned ranking model to generate, rank, and validate candidate patches.
- Experimental evaluation on Defects4J and Bugs.jar demonstrates a significant boost in correct repairs compared to traditional repair tools.
Elixir is a generate-and-validate program repair technique for object-oriented (OO) languages, specifically motivated by the critical role of method invocations (MIs) in OO program structure and bug-fixing. This approach enables the synthesis of program patches that can aggressively incorporate method calls, markedly enlarging the repair-expression space and thereby addressing classes of OO bugs often out of reach for existing techniques. The ELIXIR system uses an expressive repair-expression language and a machine-learned ranking model to effectively generate, rank, and validate candidate patches, yielding significant improvements in the repair of real-world OO software defects (Saha et al., 2021).
1. Motivation: Method Invocations in Object-Oriented Repairs
Encapsulation in OO programming locates most data and operations behind public methods, making MIs such as obj.foo(a, b) the sole avenue for state access or mutation. Empirical analyses of large Java codebases (Eclipse JDT, Platform, BIRT) demonstrate that 57% of executable statements involve at least one MI, a figure substantially higher than the 33% seen in C programs. Moreover, 77% of one-line bug-fixes in such software involve MI changes—30–40% are stand-alone MI modifications, while others are embedded within conditional or assignment fixes.
Existing generate-and-validate repair tools are systematically limited in their ability to synthesize new or overloaded MIs. They typically:
- Rely on copy-pasting existing code snippets (e.g., jGenProg) and cannot generate novel MI expressions not already in the code,
- Apply constrained templates that do not synthesize new MIs or handle overloading (e.g., PAR),
- Restrict MI handling to a curated subset of side-effect-free, parameterless methods strictly for guards (e.g., NOPOL).
The practical result is that unrestricted MI enumeration becomes computationally prohibitive—sometimes yielding hundreds or thousands of valid options per site—forcing prior techniques to heavily restrict MI-based repair and thus miss many real patches.
2. ELIXIR Framework and Repair-Expression Space
ELIXIR extends the classic four-step generate-and-validate paradigm via two principal advances: (a) a highly expressive repair-expression language that allows method calls on equal footing with variables, fields, and constants; and (b) a machine-learnt model to score and prioritize possible fixes for validation.
2.1 Framework Overview
Given a buggy program , a test suite (with at least one failing test), and an optional bug report , ELIXIR executes the following process:
- Step A: Fault localization using SBFL (e.g., Ochiai) to identify suspicious statements.
- Step B: Program transformation schemas () to generate candidate patches using the repair-expression language.
- Step C: Machine-learnt scoring of candidate patches based on contextual and semantic features.
- Step D: Validation via test-suite execution, returning the first plausible patch (i.e., one passing all tests).
Transformation Schemas
| Schema | Transformation Type | Description |
|---|---|---|
| T1 | Type widening | int→long/float/double |
| T2 | Change return expr | Replace return with another compatible expr |
| T3/T4 | Conditional guards | Null or array/collection bounds guard |
| T5 | Boolean operator mutations | Relational/infix mutations (>,,<,,==,!=) |
| T6 | Boolean predicate adjustments | Add/remove conjuncts/disjuncts |
| T7 | MI alteration | Replace object, method, arguments, or full MI |
| T8 | Insert new MI | Synthesize and insert arbitrary well-typed MI |
2.2 Repair-Expression Construction
Repair-expressions in ELIXIR follow the grammar:
At a target location, ELIXIR systematically enumerates all combinations of in-scope locals, class fields, accessible methods (including overloads), and builds all well-typed MI expressions up to a single composition depth. Formally, if denotes available variables/fields/literals and the set of method signatures (with average arity ), the candidate expression set is , and for , .
3. Machine-Learnt Patch Ranking
Due to the combinatorial explosion of candidate patches, ELIXIR employs a lightweight machine-learned ranking model to prioritize validation of the most promising candidates.
3.1 Classification Model
Each patch with repair-expression is scored as:
where is the logistic function, are learned weights, and is a four-dimensional feature vector.
3.2 Features
- (Distance Score): Proximity of ’s elements to within the source.
- (Contextual Similarity): Jaccard similarity of CamelCase-split tokens in versus code context.
- (Bug Report Similarity): Jaccard similarity of repair-expression tokens with those in (if available).
- (Context Frequency): Occurrence count of variables/fields from within lines of .
3.3 Training Process
Training uses 1,158 one-line bug-fixes from Bugs.jar, balancing “positive” (developer-chosen) and “negative” repair-expressions (4× oversampling positives, ≈1,580 data points). Ridge-regularized logistic regression is implemented via WEKA, with 10-fold cross-validation. At inference, patches are sorted by predicted relevance, and the top are validated.
4. Experimental Evaluation
4.1 Datasets
- Defects4J [Just et al. 2014]: Commons-Math, Commons-Lang, Joda-Time, JFreeChart. 82 single-hunk bugs selected.
- Bugs.jar: Eight major Apache projects, filtered to 1,158 single-hunk bugs (each with buggy version, unit tests, developer patch, and report).
4.2 Baselines and Metrics
Benchmarked against ACS, HD-Repair, NOPOL, PAR’ (re-implementation), jGenProg, and two ELIXIR ablations: Elixir₁ (traditional patch space, no ML) and Elixir₂ (rich patch space, random top-N selection). Patches are measured as “correct” (semantically matching developer fix) or “incorrect plausible” (passes tests but not equivalent).
4.3 Results
Correct and Incorrect Repairs (Defects4J):
| Subject | ELIXIR | ACS | HD-Repair | NOPOL | PAR' | jGenProg |
|---|---|---|---|---|---|---|
| Commons-Math | 12/7 | 12/4 | 6/(*) | 1/20 | 2/NR | 5/13 |
| Commons-Lang | 8/4 | 3/1 | 7/(*) | 3/4 | 1/NR | 0/0 |
| Joda-Time | 2/1 | 1/0 | 1/(*) | 0/1 | 0/NR | 0/7 |
| JFreeChart | 4/3 | 2/0 | 2/(*) | 1/5 | 0/NR | 0/2 |
| Total (82) | 26/15 | 18/5 | 16/(10*) | 5/30 | 3/NR | 5/22 |
Ablation Impact
| Variant | Repair-Exprs | Selection | Correct | Incorrect |
|---|---|---|---|---|
| Elixir₁ | Traditional (ACS-like) | None (no ML) | 14 | 16 |
| Elixir₂ | Extended (ELIXIR) | Random top-N | 13 | 5 |
| Elixir | Extended | Logistic reg | 26 | 15 |
Schema Contribution
| Schema | Correct | Incorrect |
|---|---|---|
| Change in MI (T7) | 12 | 6 |
| Boolean expr change | 6 | 8 |
| New MI insertion (T8) | 3 | 0 |
| Type widening | 2 | 0 |
| Return expr change | 2 | 0 |
| Null/size guard (T3/T4) | 1 | 1 |
Results on Bugs.jar (Sampled 127 single-hunk bugs)
- ELIXIR: 22 correct / 17 incorrect
- Elixir₁: 14 correct / 16 incorrect
This reflects an 85% boost in correct repairs on Defects4J (from 14 to 26) and a 57% improvement on Bugs.jar (14 to 22) over the baseline.
5. Insights, Limitations, and Future Directions
The primary insight is that the expressive MI-focused repair-expression space enables ELIXIR to address entire bug classes missed by prior tools. This efficacy is contingent on the ranking system’s ability to surface correct patches among hundreds or thousands of candidates. The model’s four features—locality, code-context similarity, bug-report alignment, and usage frequency—jointly capture signals demonstrated to be effective in automated repair, code completion, and bug localization.
ELIXIR’s principal limitations include its restriction to single-hunk patches, reliance on a bug report for , and the simplicity of its feature set and logistic model. As the repair-expression language and ranking model are Java-specific (implemented via Spoon and ASM), generalization to other OO languages would necessitate additional grammar and AST transformation work.
Potential extensions include:
- Integration with more sophisticated machine-learning models (e.g., random forests, neural models),
- Expansion to multi-location/method repairs,
- Cross-combination with oracle-based synthesis (e.g., Angelix/NOPOL),
- Extension to other OO languages by adapting language-aware grammars.
The results suggest that a generate-and-validate approach, augmented with aggressive MI synthesis and lightweight relevance ranking, can significantly expand the class of OO bugs amenable to fully automated repair (Saha et al., 2021).