Average Performance Gap Recovered (APGR)

Updated 27 January 2026

APGR is a normalized metric defined as the ratio of performance gain from an intervention to the original gap between baseline and reference, enabling cross-domain comparisons.
It is applied in diverse areas such as language model alignment, adversarial elicitation, privacy-preserving protocols, and combinatorial optimization to assess incremental improvements.
Empirical studies show APGR values ranging from around 3% in multilingual tuning to nearly 100% in privacy-preserving consensus, reflecting domain-specific performance nuances.

Average Performance Gap Recovered (APGR) quantifies the fraction of an initial performance shortfall—relative to a gold-standard or higher-performing baseline—that is eliminated by a specified intervention, algorithmic improvement, or elicitation process. It is defined precisely as the ratio of the gain attributable to the intervention over the original gap between baseline and reference performances, aggregated as appropriate across tasks, languages, or instances. APGR appears across multiple domains, including LLM alignment, adversarial capability elicitation, privacy-preserving consensus protocols, and approximate combinatorial optimization, serving as a normalized, interpretable indicator of incremental progress or risk.

1. Formal Definition and Core Formulas

Let $P_{\mathrm{ref}}$ denote the reference performance (e.g., top-line accuracy, unrestricted capability), $P_{\mathrm{base}}$ the base performance (e.g., unaugmented model, decentralized protocol, or base greedy policy), and $P_{\mathrm{interv}}$ the post-intervention performance. The absolute gap is $G_{\mathrm{base}} = P_{\mathrm{ref}} - P_{\mathrm{base}}$ , and the gain from intervention is $G_{\mathrm{int}} = P_{\mathrm{interv}} - P_{\mathrm{base}}$ . The Average Performance Gap Recovered is then:

$\mathrm{APGR} = \frac{G_{\mathrm{int}}}{G_{\mathrm{base}}} \times 100\%$

In multi-task or multi-language evaluations, per-task or per-language PGR values are averaged:

$\mathrm{APGR} = \frac{1}{K}\sum_{i=1}^K \frac{P_{\mathrm{interv},i} - P_{\mathrm{base},i}}{P_{\mathrm{ref},i} - P_{\mathrm{base},i}} \times 100\%$

Domain-specific quantifications may utilize accuracy, rubric-based scores, mean-square error, or other scalar metrics as appropriate (Alhanai et al., 2024, Kaunismaa et al., 20 Jan 2026, Wang et al., 2023, Mastin et al., 2013).

2. Applications in LLM Performance Parity

In multilingual benchmarking for LLMs, APGR measures the percentage of the original English-versus-non-English accuracy gap that is overcome via fine-tuning, cross-lingual transfer, or data quality filtering. For instance, if Performanceₑₙ is the English benchmark accuracy and Performanceₗₐₙg is that for a target language, then for an intervention “method”:

$\mathrm{APGR}_{\mathrm{method},\,\text{lang}} = \frac{\mathrm{Performance}_{\mathrm{method},\,\text{lang}} - \mathrm{Performance}_{\mathrm{base},\,\text{lang}}}{\mathrm{Performance}_{\mathrm{en}} - \mathrm{Performance}_{\mathrm{base},\,\text{lang}}} \times 100\%$

Empirical results demonstrate the following mean APGRs over eight African languages and multiple interventions (Alhanai et al., 2024):

Intervention Type	Mean APGR (%)
Mono-lingual fine-tuning	5.6
High-quality vs. low-quality	5.4
Cross-lingual transfer	2.9
Cultural appropriateness	3.0

Results also document non-uniform APGR across languages, with domain alignment and data quality exerting substantial influence, and gains in high-resource or related languages surpassing those in typologically distant, lower-resource cases.

3. Measurement of Elicitation Risk in Model Safeguard Circumvention

In adversarial machine learning, APGR quantifies the fraction of the difference in capability between a base open-source model and an unguarded frontier model that is bridged by exploiting outputs from a safeguarded system (Kaunismaa et al., 20 Jan 2026). For task $i$ out of $K$ :

$\Delta_{\mathrm{orig},i} = P_{\mathrm{frontier},i} - P_{\mathrm{open},i}$

$\Delta_{\mathrm{rec},i} = P_{\mathrm{tuned},i} - P_{\mathrm{open},i}$

$\mathrm{APGR} = \frac{1}{K} \sum_{i=1}^K \frac{\Delta_{\mathrm{rec},i}}{\Delta_{\mathrm{orig},i}} \times 100\%$

Empirical studies on hazardous chemical synthesis tasks yield anchored-comparison APGR near 39% for Llama 3.3 70B, with APGR values scaling up to 71% as the underlying frontier model’s capability increases. This reveals that the hazardous information suppression provided by output-level safeguards can be partially circumvented via benign-appearing elicitation and fine-tuning cycles.

4. Accuracy–Privacy Trade-off Recovery in Distributed Consensus

In privacy-preserving average consensus, APGR tracks how much of the gap in mean-square error (MSE) between decentralized and centralized differentially-private protocols is closed by cryptographic and correlated-noise augmentations (Wang et al., 2023). The key outcome is that, by combining Paillier-based zero-sum shuffling and small centralized perturbations, one recovers the optimal $1/n^2$ scaling of MSE achieved by trusted centralized aggregation (such that):

$\lim_{g \to 0} \mathrm{MSE}_{\mathrm{DP\text{-}consensus}} = \mathrm{MSE}_{\mathrm{centralized}} + o(1)$

This indicates practical APGR near 100%, so long as specified cryptographic and topological conditions are satisfied.

5. Approximation Guarantee Improvements in Combinatorial Optimization

In stochastic analyses of rollout algorithms for the subset sum and knapsack problems, APGR is the proportion of the base greedy solution’s average gap to capacity (or, analogously, profit shortfall) recovered on average by a single iteration of rollout (Mastin et al., 2013):

$\mathrm{APGR} = \frac{\mathbf{E}[G_b] - \mathbf{E}[G_r]}{\mathbf{E}[G_b]}$

with $\mathbf{E}[G_b]=1/3$ for subset sum. After one “consecutive” rollout,

$\mathbf{E}[G_r] \leq 7/30 \implies \mathrm{APGR} \geq 0.30$

Thus, rollout methods provably realize at least a 30% mean reduction in gap. The “exhaustive” rollout variant achieves asymptotic APGR approaching $1$ as $n$ grows, nearly entirely eliminating the greedy policy’s deficit.

6. Comparative Table of APGR Metrics Across Domains

Domain	Reference/Formula	Typical APGR Value
LLM Multilinguality	$(\mathrm{Perf}_{\mathrm{method,lang}} - \mathrm{Perf}_{\mathrm{base,lang}})/$ original gap	$2.9$– $5.6\%$
Adversarial Elicitation	$(P_{\mathrm{tuned}} - P_{\mathrm{open}})/(P_{\mathrm{frontier}} - P_{\mathrm{open}})$	$24.7$– $61.5\%$
DP Consensus	$(\mathrm{MSE}_{\mathrm{base}}-\mathrm{MSE}_{\mathrm{improved}})/ (\mathrm{MSE}_{\mathrm{base}}-\mathrm{MSE}_{\mathrm{central}})$	$\approx100\%$
Knapsack Rollout	$(\mathbf{E}[G_b]-\mathbf{E}[G_r])/\mathbf{E}[G_b]$	at least $30\%$

7. Interpretation, Limitations, and Best Practices

APGR offers a normalized, interpretable measure for comparing the effectiveness of diverse interventions relative to a persistent performance deficit, independent of the absolute scale of the endpoint metric. However, it depends critically on the definition of the baseline gap, the choice of performance metric, and statistical significance constraints. For trustworthy evaluation, APGR must be reported alongside absolute performance levels, calculated via robust, well-calibrated benchmarks, and disaggregated over relevant task or language axes. Its utility extends to risk assessment (e.g., model alignment red-teaming), ablation study normalization, and cross-method comparison, but assumptions embedded in baseline definitions and modeling choices must be made explicit to ensure valid comparison (Alhanai et al., 2024, Kaunismaa et al., 20 Jan 2026, Wang et al., 2023, Mastin et al., 2013).

Markdown Report Issue Upgrade to Chat

References (4)

Bridging the Gap: Enhancing LLM Performance for Low-Resource African Languages with New Benchmarks, Fine-Tuning, and Cultural Adjustments (2024)

Eliciting Harmful Capabilities by Fine-Tuning On Safeguarded Outputs (2026)

Differentially Private Average Consensus with Improved Accuracy-Privacy Trade-off (2023)

Average-Case Performance of Rollout Algorithms for Knapsack Problems (2013)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Average Performance Gap Recovered (APGR).

Average Performance Gap Recovered (APGR)

1. Formal Definition and Core Formulas

2. Applications in LLM Performance Parity

3. Measurement of Elicitation Risk in Model Safeguard Circumvention

4. Accuracy–Privacy Trade-off Recovery in Distributed Consensus

5. Approximation Guarantee Improvements in Combinatorial Optimization

6. Comparative Table of APGR Metrics Across Domains

7. Interpretation, Limitations, and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Average Performance Gap Recovered (APGR)

1. Formal Definition and Core Formulas

2. Applications in LLM Performance Parity

3. Measurement of Elicitation Risk in Model Safeguard Circumvention

4. Accuracy–Privacy Trade-off Recovery in Distributed Consensus

5. Approximation Guarantee Improvements in Combinatorial Optimization

6. Comparative Table of APGR Metrics Across Domains

7. Interpretation, Limitations, and Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research