Elixir: Effective OO Program Repair

Updated 17 March 2026

The paper introduces a novel generate-and-validate approach that expands the repair-expression space using aggressive method invocation synthesis.
ELIXIR employs an expressive repair-expression language and a machine-learned ranking model to generate, rank, and validate candidate patches.
Experimental evaluation on Defects4J and Bugs.jar demonstrates a significant boost in correct repairs compared to traditional repair tools.

Elixir is a generate-and-validate program repair technique for object-oriented (OO) languages, specifically motivated by the critical role of method invocations (MIs) in OO program structure and bug-fixing. This approach enables the synthesis of program patches that can aggressively incorporate method calls, markedly enlarging the repair-expression space and thereby addressing classes of OO bugs often out of reach for existing techniques. The ELIXIR system uses an expressive repair-expression language and a machine-learned ranking model to effectively generate, rank, and validate candidate patches, yielding significant improvements in the repair of real-world OO software defects (Saha et al., 2021).

1. Motivation: Method Invocations in Object-Oriented Repairs

Encapsulation in OO programming locates most data and operations behind public methods, making MIs such as obj.foo(a, b) the sole avenue for state access or mutation. Empirical analyses of large Java codebases (Eclipse JDT, Platform, BIRT) demonstrate that 57% of executable statements involve at least one MI, a figure substantially higher than the 33% seen in C programs. Moreover, 77% of one-line bug-fixes in such software involve MI changes—30–40% are stand-alone MI modifications, while others are embedded within conditional or assignment fixes.

Existing generate-and-validate repair tools are systematically limited in their ability to synthesize new or overloaded MIs. They typically:

Rely on copy-pasting existing code snippets (e.g., jGenProg) and cannot generate novel MI expressions not already in the code,
Apply constrained templates that do not synthesize new MIs or handle overloading (e.g., PAR),
Restrict MI handling to a curated subset of side-effect-free, parameterless methods strictly for guards (e.g., NOPOL).

The practical result is that unrestricted MI enumeration becomes computationally prohibitive—sometimes yielding hundreds or thousands of valid options per site—forcing prior techniques to heavily restrict MI-based repair and thus miss many real patches.

2. ELIXIR Framework and Repair-Expression Space

ELIXIR extends the classic four-step generate-and-validate paradigm via two principal advances: (a) a highly expressive repair-expression language that allows method calls on equal footing with variables, fields, and constants; and (b) a machine-learnt model to score and prioritize possible fixes for validation.

2.1 Framework Overview

Given a buggy program $P$ , a test suite $T$ (with at least one failing test), and an optional bug report $R$ , ELIXIR executes the following process:

Step A: Fault localization using SBFL (e.g., Ochiai) to identify suspicious statements.
Step B: Program transformation schemas ( $\mathrm{T}_1 \ldots \mathrm{T}_8$ ) to generate candidate patches using the repair-expression language.
Step C: Machine-learnt scoring of candidate patches based on contextual and semantic features.
Step D: Validation via test-suite execution, returning the first plausible patch (i.e., one passing all tests).

Transformation Schemas

Schema	Transformation Type	Description
T1	Type widening	int→long/float/double
T2	Change return expr	Replace return with another compatible expr
T3/T4	Conditional guards	Null or array/collection bounds guard
T5	Boolean operator mutations	Relational/infix mutations (>, $\geq$ ,<, $\leq$ ,==,!=)
T6	Boolean predicate adjustments	Add/remove conjuncts/disjuncts
T7	MI alteration	Replace object, method, arguments, or full MI
T8	Insert new MI	Synthesize and insert arbitrary well-typed MI

2.2 Repair-Expression Construction

Repair-expressions in ELIXIR follow the grammar:

$literal ::= boolean \mid number \mid null$
$variable ::= id$
$field ::= id.id$
$array ::= id[expression]$
$methodInvocation ::= id(args) \mid id.id(args)$

At a target location, ELIXIR systematically enumerates all combinations of in-scope locals, class fields, accessible methods (including overloads), and builds all well-typed MI expressions up to a single composition depth. Formally, if $V$ denotes available variables/fields/literals and $M$ the set of method signatures (with average arity $a$ ), the candidate expression set is $RE = V \cup \{f(e) \mid f \in M, e \in V^n, \text{type-match}\}$ , and $|RE| \approx v + m \cdot v^a$ for $v = |V|$ , $m = |M|$ .

3. Machine-Learnt Patch Ranking

Due to the combinatorial explosion of candidate patches, ELIXIR employs a lightweight machine-learned ranking model to prioritize validation of the most promising candidates.

3.1 Classification Model

Each patch $p$ with repair-expression $r$ is scored as:

$score(p \mid loc, R) = \sigma(w \cdot \phi(p, loc, R))$

where $\sigma(t) = 1/(1+e^{-t})$ is the logistic function, $w \in \mathbb{R}^4$ are learned weights, and $\phi$ is a four-dimensional feature vector.

3.2 Features

$\phi_1$ (Distance Score): Proximity of $r$ ’s elements to $loc$ within the source.
$\phi_2$ (Contextual Similarity): Jaccard similarity of CamelCase-split tokens in $r$ versus code context.
$\phi_3$ (Bug Report Similarity): Jaccard similarity of repair-expression tokens with those in $R$ (if available).
$\phi_4$ (Context Frequency): Occurrence count of variables/fields from $r$ within $\pm3$ lines of $loc$ .

3.3 Training Process

Training uses 1,158 one-line bug-fixes from Bugs.jar, balancing “positive” (developer-chosen) and “negative” repair-expressions (4× oversampling positives, ≈1,580 data points). Ridge-regularized logistic regression is implemented via WEKA, with 10-fold cross-validation. At inference, patches are sorted by predicted relevance, and the top $N = 50$ are validated.

4. Experimental Evaluation

4.1 Datasets

Defects4J [Just et al. 2014]: Commons-Math, Commons-Lang, Joda-Time, JFreeChart. 82 single-hunk bugs selected.
Bugs.jar: Eight major Apache projects, filtered to 1,158 single-hunk bugs (each with buggy version, unit tests, developer patch, and report).

4.2 Baselines and Metrics

Benchmarked against ACS, HD-Repair, NOPOL, PAR’ (re-implementation), jGenProg, and two ELIXIR ablations: Elixir₁ (traditional patch space, no ML) and Elixir₂ (rich patch space, random top-N selection). Patches are measured as “correct” (semantically matching developer fix) or “incorrect plausible” (passes tests but not equivalent).

4.3 Results

Correct and Incorrect Repairs (Defects4J):

Subject	ELIXIR	ACS	HD-Repair	NOPOL	PAR'	jGenProg
Commons-Math	12/7	12/4	6/(*)	1/20	2/NR	5/13
Commons-Lang	8/4	3/1	7/(*)	3/4	1/NR	0/0
Joda-Time	2/1	1/0	1/(*)	0/1	0/NR	0/7
JFreeChart	4/3	2/0	2/(*)	1/5	0/NR	0/2
Total (82)	26/15	18/5	16/(10*)	5/30	3/NR	5/22

Ablation Impact

Variant	Repair-Exprs	Selection	Correct	Incorrect
Elixir₁	Traditional (ACS-like)	None (no ML)	14	16
Elixir₂	Extended (ELIXIR)	Random top-N	13	5
Elixir	Extended	Logistic reg	26	15

Schema Contribution

Schema	Correct	Incorrect
Change in MI (T7)	12	6
Boolean expr change	6	8
New MI insertion (T8)	3	0
Type widening	2	0
Return expr change	2	0
Null/size guard (T3/T4)	1	1

Results on Bugs.jar (Sampled 127 single-hunk bugs)

ELIXIR: 22 correct / 17 incorrect
Elixir₁: 14 correct / 16 incorrect

This reflects an 85% boost in correct repairs on Defects4J (from 14 to 26) and a 57% improvement on Bugs.jar (14 to 22) over the baseline.

5. Insights, Limitations, and Future Directions

The primary insight is that the expressive MI-focused repair-expression space enables ELIXIR to address entire bug classes missed by prior tools. This efficacy is contingent on the ranking system’s ability to surface correct patches among hundreds or thousands of candidates. The model’s four features—locality, code-context similarity, bug-report alignment, and usage frequency—jointly capture signals demonstrated to be effective in automated repair, code completion, and bug localization.

ELIXIR’s principal limitations include its restriction to single-hunk patches, reliance on a bug report for $S_{br}$ , and the simplicity of its feature set and logistic model. As the repair-expression language and ranking model are Java-specific (implemented via Spoon and ASM), generalization to other OO languages would necessitate additional grammar and AST transformation work.

Potential extensions include:

Integration with more sophisticated machine-learning models (e.g., random forests, neural models),
Expansion to multi-location/method repairs,
Cross-combination with oracle-based synthesis (e.g., Angelix/NOPOL),
Extension to other OO languages by adapting language-aware grammars.

The results suggest that a generate-and-validate approach, augmented with aggressive MI synthesis and lightweight relevance ranking, can significantly expand the class of OO bugs amenable to fully automated repair (Saha et al., 2021).

Markdown Report Issue Upgrade to Chat

References (1)

Elixir: Effective object-oriented program repair (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Elixir: Effective Object Oriented Program Repair.