Hunk4J: Multi-Hunk Java Repair Dataset
- Hunk4J is a dataset and benchmark for multi-hunk program repair, curated from real-world Java projects with 372 bugs spanning diverse repair scenarios.
- It quantifies hunk divergence using lexical, structural, and file-level metrics while classifying patches by spatial proximity to assess repair complexity.
- Empirical insights from Hunk4J reveal challenges for LLM and agent-based APR systems, guiding improvements in repair strategies and resource management.
Hunk4J is a dataset and benchmark specializing in multi-hunk program repair, targeting defects whose developer-written patches span two or more disjoint code regions in real-world Java projects. It is designed to expose fundamental challenges and failure modes for contemporary automated program repair (APR) systems—especially LLM-based and agentic solvers—by providing structured, fine-grained annotation of patch heterogeneity, spatial distribution, and a comprehensive set of empirical repair results. Hunk4J was constructed from 372 multi-hunk bugs curated from Defects4J v2.0.1, covering 17 open-source Java projects and encompassing both lexically and semantically diverse repair scenarios. It underpins recent investigations into the behavioral limits of LLM repair, repair trajectory analytics, and divergence-aware agent architectures (Nashid et al., 4 Jun 2025, Nashid et al., 14 Nov 2025).
1. Dataset Composition and Annotation
Hunk4J is constructed from the Defects4J corpus, version 2.0.1, isolating all developer patches with at least two non-contiguous code modifications (hunks). Of the 835 available defects, 372 (44.6%) are classified as multi-hunk and included after manual triage and diff extraction. The dataset spans Java projects such as JacksonDatabind, Closure, Chart, Mockito, and Jsoup, representing domains like charting, CLI tools, JSON/XML parsing, HTML processing, utilities, and time/mocking frameworks.
The distribution of multi-hunk patches by file coverage and hunk count is as follows:
| File Scope | Hunk Count | # Bugs |
|---|---|---|
| Single-file | Two | 140 |
| Three | 55 | |
| Four+ | 49 | |
| Multi-file | Two | 37 |
| Three | 23 | |
| Four+ | 68 | |
| Total | 372 |
Project-level characteristics include a median hunk count per bug of 3.0 (mean 3.86, max 47) and median file count per bug of 1.0 (mean 1.59, max 16).
For each bug, Hunk4J provides a structured JSON entry containing project name, bug ID, ground-truth patch diff, and granular annotation of hunks (location, source, file, enclosing method), plus enriched natural language context (issue title and summarized description). The dataset thus enables analysis of both code- and context-aware repair settings (Nashid et al., 4 Jun 2025).
2. Hunk Divergence Metric
Central to Hunk4J is the hunk divergence metric, designed to quantify intra-patch heterogeneity and inform empirical studies on multi-hunk repair hardness.
Pairwise Divergence
For hunks , in a patch, divergence is defined as:
- , with , the token sequences of the hunks.
- depends on AST distance:
- for file hierarchy:
- The weight is $1$ for hunks in the same file and $2$ otherwise.
Patch-Level Divergence
For a patch with hunks:
Empirical summary: lexical distance median 0.94 (mean 0.82), structural median 0.6 (mean 0.62), multi-file patch file distance median 0.25 (mean 0.25). Overall patch divergence spans 0.00, 1.60, with 25% of patches exhibiting (Nashid et al., 4 Jun 2025).
3. Spatial Proximity Classification
Multi-hunk patches are further categorized by spatial dispersion, formalized as a five-way proximity taxonomy reflecting the program hierarchy extent traversed by edits. For a patch :
- : all hunks in same method
- : all hunks in same file
- : all hunks in same package
- : minimum length of longest common path prefix among file paths (with )
Class assignment :
The prevalence and mean divergence per class are shown below:
| Class | # Bugs | Mean Div(P) |
|---|---|---|
| Nucleus | 59 | 0.2548 |
| Cluster | 185 | 0.4280 |
| Orbit | 67 | 0.5628 |
| Sprawl | 50 | 0.6718 |
| Fragment | 11 | 0.7372 |
This classification provides an axis for stratified analysis of repair success and resource cost (Nashid et al., 4 Jun 2025).
4. Benchmarks, Metrics, and Agent Evaluation
Hunk4J is integrated as a benchmark in Birch, an open-source APR evaluation platform. Birch provides reproducible infrastructure for:
- Extracting buggy programs and tests.
- Constructing standardized prompts (including NL context and failing test outputs).
- Well-formed patch application, compilation, and test execution.
- Supporting contextualized or retrieval-augmented LLM prompts (BM25, MiniLM, dense embedding-based).
- Varying scope (method/class/file/global) and feedback-driven repair loops.
Agentic repair experiments using Hunk4J have involved standardized prompt templates, repository-level tool protocols (e.g., via Maple), and detailed trajectory logging (Nashid et al., 14 Nov 2025).
Repair performance is measured along multiple axes. Let (bug), (agent), and (repair trajectory):
- File-level localization success: if all ground-truth buggy hunks are in , else 0.
- Compilation success: if a nonempty patch is produced and compiles.
- Repair accuracy: if a patch passes all tests after repair.
- Regression reduction: is the difference in failed test count before and after.
Resource usage (input/output tokens, runtime) and fine-grained agent action categories (e.g., NAVIGATE, WRITE, SEARCH_CONTENT) are also measured.
5. Empirical Insights on Multi-Hunk Repair
Experiments with LLMs (vanilla and retrieval-augmented) and code agents (Claude Code, Codex, Gemini-cli, Qwen Code) reveal several consistent findings:
- LLM (single-turn) Tuning: In baseline, o4-mini achieves a 26.9% plausible@1 repair rate, GPT-4.1 22.0%, and open-source models~12%. With augmentations (retrieval + feedback) the plausible@1 for o4-mini rises to 35.8%, but no LLM fixes any Fragment-class bug in LLM-only mode.
- Agentic Systems: Claude Code repairs 93.3% of Hunk4J bugs, Codex 87.1%, Gemini-cli 41.7%, Qwen Code 25.8%, all under identical prompt regimes (Nashid et al., 14 Nov 2025).
- Divergence and Proximity Correlation: Success rates decrease as patch-level divergence increases and as proximity class becomes more dispersed. For example, agent fix rates by class (Claude/Codex/Gemini/Qwen):
| Class | Claude (%) | Codex (%) | Gemini (%) | Qwen (%) | |-----------|------------|-----------|------------|----------| | Nucleus | 100 | 91.53 | 44.07 | 22.03 | | Cluster | 93.51 | 84.32 | 48.11 | 31.35 | | Orbit | 89.55 | 91.04 | 26.87 | 19.40 | | Sprawl | 92.00 | 92.00 | 36.00 | 24.00 | | Fragment | 81.82 | 63.64 | 18.18 | 0.00 |
- Fixed vs. Unfixed Divergence: For all agents, the median Div(P) for fixed bugs is significantly lower than for unfixed (, ). This emphasizes that repair hardness is closely linked to cross-hunk lexical, structural, and file-level variety (Nashid et al., 4 Jun 2025, Nashid et al., 14 Nov 2025).
- Resource and Runtime Penalty: Failed repairs are consistently more resource-intensive. For example, Gemini-cli failed runs consume 343% more input and 190% more output tokens than successful repairs, with median repair time rising by 633.1 s. High agentic overheads (e.g., excessive NAVIGATE in Qwen, over-modification in Gemini-cli, and over-exploration in Claude) are observed across behavior logs (Nashid et al., 14 Nov 2025).
6. Tooling, Contextualization, and Future Implications
Hunk4J includes Birch for benchmarking and supports Maple, a repository-level Model Context Protocol facilitating AST-based retrieval and file/method/class localization. Empirical evaluation demonstrates that enabling repository interaction via Maple yields a 30% relative boost in Gemini-cli repair accuracy on a 50-bug sample, with largest improvements in localization and repair rates for more dispersed bugs. Direct LLM attempts remain ineffective for highly distributed Fragment-class bugs (Nashid et al., 14 Nov 2025).
The Hunk4J experience demonstrates a structural gap for current LLM and agentic repair strategies on real-world, heterogeneous, multi-hunk patches. It motivates development of divergence- and proximity-aware models, sophisticated context management, and explicit reasoning over patch structure.
7. Significance and Research Impact
Hunk4J is the first large-scale dataset to operationalize hunk divergence and spatial proximity as first-class, quantitatively tractable features for multi-hunk repair. It provides the community with:
- A reproducible source of real-world, multi-hunk repair challenges.
- Metrics and annotations for quantitative and stratified analysis of repair systems.
- A reference benchmark setting for evaluating divergence robustness, localization efficacy, and behavioral patterns in LLMs and agents.
The Hunk4J corpus and its associated benchmarking tools have become integral to recent studies investigating behavioral dynamics, trajectory-level analytics, and the architecture of repair-capable AI agents for challenging patch dispersion settings (Nashid et al., 4 Jun 2025, Nashid et al., 14 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free