Papers
Topics
Authors
Recent
Search
2000 character limit reached

Average Performance Gap Recovered (APGR)

Updated 27 January 2026
  • APGR is a normalized metric defined as the ratio of performance gain from an intervention to the original gap between baseline and reference, enabling cross-domain comparisons.
  • It is applied in diverse areas such as language model alignment, adversarial elicitation, privacy-preserving protocols, and combinatorial optimization to assess incremental improvements.
  • Empirical studies show APGR values ranging from around 3% in multilingual tuning to nearly 100% in privacy-preserving consensus, reflecting domain-specific performance nuances.

Average Performance Gap Recovered (APGR) quantifies the fraction of an initial performance shortfall—relative to a gold-standard or higher-performing baseline—that is eliminated by a specified intervention, algorithmic improvement, or elicitation process. It is defined precisely as the ratio of the gain attributable to the intervention over the original gap between baseline and reference performances, aggregated as appropriate across tasks, languages, or instances. APGR appears across multiple domains, including LLM alignment, adversarial capability elicitation, privacy-preserving consensus protocols, and approximate combinatorial optimization, serving as a normalized, interpretable indicator of incremental progress or risk.

1. Formal Definition and Core Formulas

Let PrefP_{\mathrm{ref}} denote the reference performance (e.g., top-line accuracy, unrestricted capability), PbaseP_{\mathrm{base}} the base performance (e.g., unaugmented model, decentralized protocol, or base greedy policy), and PintervP_{\mathrm{interv}} the post-intervention performance. The absolute gap is Gbase=PrefPbaseG_{\mathrm{base}} = P_{\mathrm{ref}} - P_{\mathrm{base}}, and the gain from intervention is Gint=PintervPbaseG_{\mathrm{int}} = P_{\mathrm{interv}} - P_{\mathrm{base}}. The Average Performance Gap Recovered is then:

APGR=GintGbase×100%\mathrm{APGR} = \frac{G_{\mathrm{int}}}{G_{\mathrm{base}}} \times 100\%

In multi-task or multi-language evaluations, per-task or per-language PGR values are averaged:

APGR=1Ki=1KPinterv,iPbase,iPref,iPbase,i×100%\mathrm{APGR} = \frac{1}{K}\sum_{i=1}^K \frac{P_{\mathrm{interv},i} - P_{\mathrm{base},i}}{P_{\mathrm{ref},i} - P_{\mathrm{base},i}} \times 100\%

Domain-specific quantifications may utilize accuracy, rubric-based scores, mean-square error, or other scalar metrics as appropriate (Alhanai et al., 2024, Kaunismaa et al., 20 Jan 2026, Wang et al., 2023, Mastin et al., 2013).

2. Applications in LLM Performance Parity

In multilingual benchmarking for LLMs, APGR measures the percentage of the original English-versus-non-English accuracy gap that is overcome via fine-tuning, cross-lingual transfer, or data quality filtering. For instance, if Performanceₑₙ is the English benchmark accuracy and Performanceₗₐₙg is that for a target language, then for an intervention “method”:

APGRmethod,lang=Performancemethod,langPerformancebase,langPerformanceenPerformancebase,lang×100%\mathrm{APGR}_{\mathrm{method},\,\text{lang}} = \frac{\mathrm{Performance}_{\mathrm{method},\,\text{lang}} - \mathrm{Performance}_{\mathrm{base},\,\text{lang}}}{\mathrm{Performance}_{\mathrm{en}} - \mathrm{Performance}_{\mathrm{base},\,\text{lang}}} \times 100\%

Empirical results demonstrate the following mean APGRs over eight African languages and multiple interventions (Alhanai et al., 2024):

Intervention Type Mean APGR (%)
Mono-lingual fine-tuning 5.6
High-quality vs. low-quality 5.4
Cross-lingual transfer 2.9
Cultural appropriateness 3.0

Results also document non-uniform APGR across languages, with domain alignment and data quality exerting substantial influence, and gains in high-resource or related languages surpassing those in typologically distant, lower-resource cases.

3. Measurement of Elicitation Risk in Model Safeguard Circumvention

In adversarial machine learning, APGR quantifies the fraction of the difference in capability between a base open-source model and an unguarded frontier model that is bridged by exploiting outputs from a safeguarded system (Kaunismaa et al., 20 Jan 2026). For task ii out of KK:

Δorig,i=Pfrontier,iPopen,i\Delta_{\mathrm{orig},i} = P_{\mathrm{frontier},i} - P_{\mathrm{open},i}

Δrec,i=Ptuned,iPopen,i\Delta_{\mathrm{rec},i} = P_{\mathrm{tuned},i} - P_{\mathrm{open},i}

APGR=1Ki=1KΔrec,iΔorig,i×100%\mathrm{APGR} = \frac{1}{K} \sum_{i=1}^K \frac{\Delta_{\mathrm{rec},i}}{\Delta_{\mathrm{orig},i}} \times 100\%

Empirical studies on hazardous chemical synthesis tasks yield anchored-comparison APGR near 39% for Llama 3.3 70B, with APGR values scaling up to 71% as the underlying frontier model’s capability increases. This reveals that the hazardous information suppression provided by output-level safeguards can be partially circumvented via benign-appearing elicitation and fine-tuning cycles.

4. Accuracy–Privacy Trade-off Recovery in Distributed Consensus

In privacy-preserving average consensus, APGR tracks how much of the gap in mean-square error (MSE) between decentralized and centralized differentially-private protocols is closed by cryptographic and correlated-noise augmentations (Wang et al., 2023). The key outcome is that, by combining Paillier-based zero-sum shuffling and small centralized perturbations, one recovers the optimal 1/n21/n^2 scaling of MSE achieved by trusted centralized aggregation (such that):

limg0MSEDP-consensus=MSEcentralized+o(1)\lim_{g \to 0} \mathrm{MSE}_{\mathrm{DP\text{-}consensus}} = \mathrm{MSE}_{\mathrm{centralized}} + o(1)

This indicates practical APGR near 100%, so long as specified cryptographic and topological conditions are satisfied.

5. Approximation Guarantee Improvements in Combinatorial Optimization

In stochastic analyses of rollout algorithms for the subset sum and knapsack problems, APGR is the proportion of the base greedy solution’s average gap to capacity (or, analogously, profit shortfall) recovered on average by a single iteration of rollout (Mastin et al., 2013):

APGR=E[Gb]E[Gr]E[Gb]\mathrm{APGR} = \frac{\mathbf{E}[G_b] - \mathbf{E}[G_r]}{\mathbf{E}[G_b]}

with E[Gb]=1/3\mathbf{E}[G_b]=1/3 for subset sum. After one “consecutive” rollout,

E[Gr]7/30    APGR0.30\mathbf{E}[G_r] \leq 7/30 \implies \mathrm{APGR} \geq 0.30

Thus, rollout methods provably realize at least a 30% mean reduction in gap. The “exhaustive” rollout variant achieves asymptotic APGR approaching $1$ as nn grows, nearly entirely eliminating the greedy policy’s deficit.

6. Comparative Table of APGR Metrics Across Domains

Domain Reference/Formula Typical APGR Value
LLM Multilinguality (Perfmethod,langPerfbase,lang)/(\mathrm{Perf}_{\mathrm{method,lang}} - \mathrm{Perf}_{\mathrm{base,lang}})/ original gap $2.9$–5.6%5.6\%
Adversarial Elicitation (PtunedPopen)/(PfrontierPopen)(P_{\mathrm{tuned}} - P_{\mathrm{open}})/(P_{\mathrm{frontier}} - P_{\mathrm{open}}) $24.7$–61.5%61.5\%
DP Consensus (MSEbaseMSEimproved)/(MSEbaseMSEcentral)(\mathrm{MSE}_{\mathrm{base}}-\mathrm{MSE}_{\mathrm{improved}})/ (\mathrm{MSE}_{\mathrm{base}}-\mathrm{MSE}_{\mathrm{central}}) 100%\approx100\%
Knapsack Rollout (E[Gb]E[Gr])/E[Gb](\mathbf{E}[G_b]-\mathbf{E}[G_r])/\mathbf{E}[G_b] at least 30%30\%

7. Interpretation, Limitations, and Best Practices

APGR offers a normalized, interpretable measure for comparing the effectiveness of diverse interventions relative to a persistent performance deficit, independent of the absolute scale of the endpoint metric. However, it depends critically on the definition of the baseline gap, the choice of performance metric, and statistical significance constraints. For trustworthy evaluation, APGR must be reported alongside absolute performance levels, calculated via robust, well-calibrated benchmarks, and disaggregated over relevant task or language axes. Its utility extends to risk assessment (e.g., model alignment red-teaming), ablation study normalization, and cross-method comparison, but assumptions embedded in baseline definitions and modeling choices must be made explicit to ensure valid comparison (Alhanai et al., 2024, Kaunismaa et al., 20 Jan 2026, Wang et al., 2023, Mastin et al., 2013).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Average Performance Gap Recovered (APGR).