Average Performance Gap Recovered (APGR)
- APGR is a normalized metric defined as the ratio of performance gain from an intervention to the original gap between baseline and reference, enabling cross-domain comparisons.
- It is applied in diverse areas such as language model alignment, adversarial elicitation, privacy-preserving protocols, and combinatorial optimization to assess incremental improvements.
- Empirical studies show APGR values ranging from around 3% in multilingual tuning to nearly 100% in privacy-preserving consensus, reflecting domain-specific performance nuances.
Average Performance Gap Recovered (APGR) quantifies the fraction of an initial performance shortfall—relative to a gold-standard or higher-performing baseline—that is eliminated by a specified intervention, algorithmic improvement, or elicitation process. It is defined precisely as the ratio of the gain attributable to the intervention over the original gap between baseline and reference performances, aggregated as appropriate across tasks, languages, or instances. APGR appears across multiple domains, including LLM alignment, adversarial capability elicitation, privacy-preserving consensus protocols, and approximate combinatorial optimization, serving as a normalized, interpretable indicator of incremental progress or risk.
1. Formal Definition and Core Formulas
Let denote the reference performance (e.g., top-line accuracy, unrestricted capability), the base performance (e.g., unaugmented model, decentralized protocol, or base greedy policy), and the post-intervention performance. The absolute gap is , and the gain from intervention is . The Average Performance Gap Recovered is then:
In multi-task or multi-language evaluations, per-task or per-language PGR values are averaged:
Domain-specific quantifications may utilize accuracy, rubric-based scores, mean-square error, or other scalar metrics as appropriate (Alhanai et al., 2024, Kaunismaa et al., 20 Jan 2026, Wang et al., 2023, Mastin et al., 2013).
2. Applications in LLM Performance Parity
In multilingual benchmarking for LLMs, APGR measures the percentage of the original English-versus-non-English accuracy gap that is overcome via fine-tuning, cross-lingual transfer, or data quality filtering. For instance, if Performanceₑₙ is the English benchmark accuracy and Performanceₗₐₙg is that for a target language, then for an intervention “method”:
Empirical results demonstrate the following mean APGRs over eight African languages and multiple interventions (Alhanai et al., 2024):
| Intervention Type | Mean APGR (%) |
|---|---|
| Mono-lingual fine-tuning | 5.6 |
| High-quality vs. low-quality | 5.4 |
| Cross-lingual transfer | 2.9 |
| Cultural appropriateness | 3.0 |
Results also document non-uniform APGR across languages, with domain alignment and data quality exerting substantial influence, and gains in high-resource or related languages surpassing those in typologically distant, lower-resource cases.
3. Measurement of Elicitation Risk in Model Safeguard Circumvention
In adversarial machine learning, APGR quantifies the fraction of the difference in capability between a base open-source model and an unguarded frontier model that is bridged by exploiting outputs from a safeguarded system (Kaunismaa et al., 20 Jan 2026). For task out of :
Empirical studies on hazardous chemical synthesis tasks yield anchored-comparison APGR near 39% for Llama 3.3 70B, with APGR values scaling up to 71% as the underlying frontier model’s capability increases. This reveals that the hazardous information suppression provided by output-level safeguards can be partially circumvented via benign-appearing elicitation and fine-tuning cycles.
4. Accuracy–Privacy Trade-off Recovery in Distributed Consensus
In privacy-preserving average consensus, APGR tracks how much of the gap in mean-square error (MSE) between decentralized and centralized differentially-private protocols is closed by cryptographic and correlated-noise augmentations (Wang et al., 2023). The key outcome is that, by combining Paillier-based zero-sum shuffling and small centralized perturbations, one recovers the optimal scaling of MSE achieved by trusted centralized aggregation (such that):
This indicates practical APGR near 100%, so long as specified cryptographic and topological conditions are satisfied.
5. Approximation Guarantee Improvements in Combinatorial Optimization
In stochastic analyses of rollout algorithms for the subset sum and knapsack problems, APGR is the proportion of the base greedy solution’s average gap to capacity (or, analogously, profit shortfall) recovered on average by a single iteration of rollout (Mastin et al., 2013):
with for subset sum. After one “consecutive” rollout,
Thus, rollout methods provably realize at least a 30% mean reduction in gap. The “exhaustive” rollout variant achieves asymptotic APGR approaching $1$ as grows, nearly entirely eliminating the greedy policy’s deficit.
6. Comparative Table of APGR Metrics Across Domains
| Domain | Reference/Formula | Typical APGR Value |
|---|---|---|
| LLM Multilinguality | original gap | $2.9$– |
| Adversarial Elicitation | $24.7$– | |
| DP Consensus | ||
| Knapsack Rollout | at least |
7. Interpretation, Limitations, and Best Practices
APGR offers a normalized, interpretable measure for comparing the effectiveness of diverse interventions relative to a persistent performance deficit, independent of the absolute scale of the endpoint metric. However, it depends critically on the definition of the baseline gap, the choice of performance metric, and statistical significance constraints. For trustworthy evaluation, APGR must be reported alongside absolute performance levels, calculated via robust, well-calibrated benchmarks, and disaggregated over relevant task or language axes. Its utility extends to risk assessment (e.g., model alignment red-teaming), ablation study normalization, and cross-method comparison, but assumptions embedded in baseline definitions and modeling choices must be made explicit to ensure valid comparison (Alhanai et al., 2024, Kaunismaa et al., 20 Jan 2026, Wang et al., 2023, Mastin et al., 2013).