BackportBench: Patch Backporting Benchmark

Updated 8 December 2025

BackportBench is a multilingual benchmark suite that rigorously evaluates automated patch backporting across Python, Java, and JavaScript ecosystems with Dockerized setups.
It systematically curates 202 real-world backporting tasks from the OSV database to test both traditional and LLM-based approaches using strict test-driven metrics.
Empirical results show that agentic LLM methods achieve higher resolve rates, though performance varies by language and task complexity.

BackportBench is a comprehensive multilingual benchmark suite designed to rigorously evaluate and compare automated techniques for patch backporting—the migration of security and bug-fix patches from newer mainline software releases to older maintenance branches. As the first repository-level benchmark to enable executable, test-driven evaluation across Python (PyPI), Java (Maven), and JavaScript (npm) ecosystems, BackportBench provides 202 real-world backporting tasks, each with a Dockerized environment and mapped to vulnerability-fixing git commits. It supports systematic assessment of both traditional and LLM-based automated backporting methods using outcome-driven metrics grounded in downstream test case validation (Zhong et al., 1 Dec 2025).

1. Benchmark Construction and Dataset Composition

BackportBench systematically curates backporting tasks by mining the OSV (Open Source Vulnerability) database (2024-08-29 dump) for three major language ecosystems: PyPI, Maven, and npm. The construction process involves deduplicating records by vulnerability alias, extracting vulnerability-fixing, SemVer-tagged GitHub commits, and retaining only backports that modify both test and non-test files within the affected package repository.

Initial filtering of 37,284 OSV records resulted in 485 records (216 PyPI, 204 Maven, 65 npm), further refined via manual validation to 619 candidate commit pairs. The benchmark ultimately focuses on the top four repositories per ecosystem (51.1% coverage), slicing each backport commit into a "test-patch" (test files only) and a "gold-patch" (non-test files). Dockerfiles lock dependency versions to the historic tagged release, with tests executed to isolate FAIL→PASS transitions attributable to the gold-patch. Task inclusion strictly requires at least one FAIL→PASS test unaffected by common import/symbol errors, yielding 202 final backporting tasks: 112 Python, 67 Java, and 23 JavaScript.

The following table summarizes key per-ecosystem statistics:

Metric	PyPI (Mean/Max)	Maven (Mean/Max)	npm (Mean/Max)
# non-test files	3,538 / 4,330	2,922 / 10,000	300 / 1,716
# non-test LOC	1.66M / 1.89M	1.22M / 4.74M	197K / 451K
# files in gold patch	5.8 / 15	4.8 / 24	2.8 / 6
# lines in gold patch	54.8 / 174	47.7 / 217	19.5 / 86
# FAIL→PASS tests	2.2 / 11	2.1 / 10	3.2 / 6
# total tests	57.6 / 735	297.7 / 1,365	73.4 / 219

(Zhong et al., 1 Dec 2025)

2. Task Definition, Oracle, and Evaluation Metrics

Each BackportBench task is defined by a tuple of three elements: the codebase of the unpatched release ( $C_{\text{old}}$ ), the patched release ( $C_{\text{new}}$ ), and the human-curated patch ( $P_{\text{new}}$ ). The objective is to synthesize a backported patch $P_{\text{old}}=f(C_{\text{old}}, C_{\text{new}}, P_{\text{new}})$ for $C_{\text{old}}$ such that, after application, all newly-passing tests ( $T_{\mathrm{F2P}}$ ) and preserved tests ( $T_{\mathrm{P2P}}$ ) pass:

$T_{\mathrm{F2P}}(C_{\text{old}}') = \text{PASS}, \quad T_{\mathrm{P2P}}(C_{\text{old}}') = \text{PASS}$

The primary quantitative metric is the Resolve Rate:

$\mathrm{ResolveRate} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}\bigl[\text{all tests in F2P and P2P pass}\bigr]$

with N = 202 tasks. Non-resolving outcomes are further categorized as GenerationFailed, OnlyF2PFailed, OnlyP2PFailed, BothFailed, or Timeout, supporting fine-grained method diagnostics.

3. Baselines and Automated Backport Techniques

BackportBench enables comprehensive evaluation of multiple backporting paradigms:

Traditional Patch-Porting Baselines:
- AutomatingZero-Shot (Pan et al. '24): Applies a fine-tuned LLM to function-level contexts, translating $P_{\text{new}} \rightarrow P_{\text{old}}$ .
- Mystique (Wu et al. '25): Guides porting via extracted semantic/syntactic function signatures. Both were originally developed for C/C++ and are adapted here for Java by function-wise patch translation and merge.
LLM-Based GitHub Issue-Resolution Techniques:
- (M)SWE-agent: Leverages LLM agentic workflows with shell, grep, and file operations to iteratively inspect, edit, and test both old and new versions. SWE-agent serves Python; MSWE-agent serves Java/JS.
- (M)Agentless: Locates relevant code hierarchically, generates candidate patches for test maximization, then validates them on the true test suite; context and $P_{\text{new}}$ are included in prompts.
- Oracle Retrieval: Provides the LLM with the oracle file list and human patch but leaves patching agentless aside from edit localization.

All methods are assessed with GPT-5-Chat (2025-09-07), Claude Sonnet 4, and Qwen3-Coder (480B), enabling cross-architecture comparison.

4. Comparative Empirical Results

Performance differs sharply by language, patch complexity, and backporting strategy. Agentic LLM approaches consistently achieve the highest Resolve Rates, particularly for cases requiring logical or structural adaptation. The key results are:

Method	Python	Java	JavaScript	Overall
(M)SWE-agent (Claude 4)	91.1%	46.3%	43.5%	70.8%
(M)Agentless (GPT-5)	68.8%	41.8%	30.4%	55.4%
Oracle-Retrieve (GPT-5)	75.9%	62.7%	60.9%	69.8%

A detailed analysis by adaptation category shows that:

Logical/structural changes (21.8% of tasks) are the hardest; SWE-agent (Claude 4) achieves 62.5% Resolve Rate here versus 21.9% for (M)Agentless (GPT-5).
Unexpectedly, "location-only" changes are easier for SWE-agent than pure "no change."

On Java, AutomatingZero-Shot resolves 58% of "no-change" but only 7.7% of "logical/structural" cases. Mystique generalizes poorly (1.5% overall). MSWE-agent (GPT-5) achieves 44.8% overall, with 30.8% on logical/structural.

Procedural methods often generate patches that break previously passing tests (P2P), exposing limitations of equivalence-based evaluation.

5. Analysis of Benchmark Scope and Insights

BackportBench's 202 task set reveals several structural and practical characteristics:

High-quality backports are concentrated in PyPI (42.7%) and Maven (45.9%), much less so in npm (11.5%).
47% of tasks address vulnerabilities rated as CVSS High/Critical, highlighting security relevance.
91.9% of backports edit the same files as the original patch, while 21.8% require significant logical or structural adaptation.
Agentic LLM agents (SWE-agent) outperform procedural LLM pipelines, especially for nontrivial adaptations and context relocations.
All LLM techniques experience a significant performance drop on Java/JS compared to Python, underscoring open questions in cross-language, repo-level program synthesis.

Traditional function-level approaches fail to generalize beyond C/C++ and struggle to preserve regression-avoiding invariants, reinforcing the need for repository-level techniques integrating dynamic validation.

6. Limitations and Future Directions

BackportBench is subject to certain constraints:

It depends on hand-crafted Dockerfiles for environment setup, risking configuration drift.
Logical/structural task diversity remains limited; expansion with more complex tasks is needed.
Automated approaches remain challenged by cross-language code understanding and holistic diff-aware synthesis.
Future research directions include automating Docker environment curation, growing the corpus with harder real-world backports, developing retrieval-augmented or graph-based agents to exploit finer-grained code history, and evaluating the ecosystem impact of automated backporting on vulnerable-dependency propagation.

7. Significance and Availability

As the first multilingual, repository-level backporting benchmark with test-driven Docker validation, BackportBench fills a critical gap for rigorous measurement in automated patch porting across mainstream open-source software. By releasing the full task set with validated Dockerized environments and test suites, the benchmark aims to standardize evaluation and spark further innovations toward practical, safe, cross-ecosystem automated backporting (Zhong et al., 1 Dec 2025).

Markdown Upgrade to Chat

References (1)

BackportBench: A Multilingual Benchmark for Automated Backporting of Patches (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BackportBench.