Generalized Average Ranking Scores (GARS)
- Generalized Average Ranking Scores (GARS) is a versatile, nonparametric ranking framework that aggregates diverse preference data through flexible mapping functions.
- It extends classical methods like Borda, Bradley–Terry, and PageRank by effectively handling incomplete top-lists and multicategory outcomes.
- GARS employs efficient approximation and debiased machine learning techniques to achieve robust statistical inference and scalable computation.
Generalized Average Ranking Scores (GARS) constitute a unified, nonparametric class of ranking metrics designed to aggregate preference data across a range of evaluation scenarios, including incomplete top-lists and pairwise comparisons with ties or multicategorical outcomes. GARS generalizes classical rank aggregation frameworks such as Borda, Bradley–Terry, and Rank Centrality by treating the item ranking problem as the task of mapping a family of itemwise pair preference probabilities to a low-dimensional score vector via a user-specified function. Efficient inferential and computational methods—such as constant-factor approximation algorithms or semiparametrically efficient estimators via debiased machine learning—are available for GARS in large-scale aggregation and evaluation contexts (Frauen et al., 29 Jan 2026, Mathieu et al., 2018).
1. Formal Definition and Motivation
Let denote the number of items to be ranked. For each context , all ordered pairs where are considered; for each pair, a categorical label is obtained, indicating the outcome (‘ beats ’, ‘ beats ’, ‘tie’, etc.), with denoting the number of response categories. Write and collecting all such conditional probabilities.
A Generalized Average Ranking Score (GARS) is specified as the expectation of a user-chosen mapping , so that
where is typically or larger. The form of can instantiate classical ranking methods or incorporate more flexible, application-specific metrics (Frauen et al., 29 Jan 2026).
2. Special Cases: Classical Rank Aggregation and Top-List Algorithms
GARS encompasses and extends several established ranking models:
- Borda (Average-Win-Rate): For binary comparisons (), the Borda score for item is
The overall score vector encodes the average likelihood prevails over across prompts.
- Bradley–Terry Projections: For inference from log-odds,
With appropriate incidence and projection matrices (, , ), GARS yields projected latent quality scores consistent with the BT model when its assumptions hold (Frauen et al., 29 Jan 2026).
- Rank Centrality/PageRank: GARS can represent stationary distributions over itemwise symmetrized transition matrices, so that the score vector solves , where is derived from preference probabilities. This accommodates PageRank-type aggregation (Frauen et al., 29 Jan 2026).
- Incomplete Top-List Aggregation: When only a subset of candidates is ranked in each input list, the pair for is defined by
The GARS vector for is , with items sorted lexicographically to produce a generalized Borda total order (Mathieu et al., 2018).
3. Algorithmic Construction and Approximation Guarantees
GARS supports algorithmic ranking procedures with rigorous worst-case approximation bounds. In the context of top-list aggregation (Mathieu et al., 2018):
- Two-Phase Generalized Borda Algorithm: Compute and for all items; sort by non-increasing , breaking ties by non-decreasing . This produces a ranking that is a constant-factor ($6$-approximation) solution to the top-list aggregation objective measured by expected Kendall- distance. The pseudocode is efficient, running in time.
- PTAS Enhancement: By bucketing items according to and using the Mathieu–Schudy PTAS for full-ranking aggregation within each bucket, a -approximation is achieved in total time . This relies on concentration arguments and bucket-respecting optimality (Mathieu et al., 2018).
| Method | Approximation Ratio | Time Complexity |
|---|---|---|
| Two-phase Generalized Borda | 6 | |
| SCORE-THEN-PTAS |
4. Semiparametric Theory and Efficient Inference
For modern LLM evaluation and similar noisy, high-dimensional contexts, GARS is equipped with semiparametric efficiency theory (Frauen et al., 29 Jan 2026):
Here is the Jacobian of w.r.t. , is the pair labeling indicator, and is the context-dependent labeling probability. Under regularity, the debiased estimator
is asymptotically normal and achieves the semiparametric lower variance bound. Joint and coordinatewise confidence regions are constructed from the empirical covariance of EIF values.
5. Estimation Procedures in Practice
Practical estimation of GARS from preference datasets is implemented via:
- Cross-Fitting with Black-Box Learners: Data is divided into folds (), with out-of-fold predictions for both (categorical classifier) and (binary classifier). Judges can be external machine learning models, incorporated as features (“judge-as-feature”).
- Debiased One-Step Estimation: After nuisance prediction, scores are corrected via the EIF formula, yielding valid uncertainty quantification and robust estimates under both parametric and nonparametric conditions.
- Handling Ties and Rich Labels: Multicategory classifiers predict responses, and category-weights in allow downstream methods (e.g., viewing ties as half-wins).
6. Optimal Data Acquisition under Budget Constraints
GARS enables principled policies for preference data acquisition:
- A-Optimality: Minimizes total score variance. The optimal sampling policy for pair and context is
Where is covariance of label, is cost, and selected by budget constraint.
- D-Optimality: Minimizes determinant of the covariance. The policy is described by a fixed-point equation involving the full covariance structure .
- Single-Pair-Per-Context Constraints: Admits “capped water-filling” policies with per-context dual variables. This supports allocation planning in LLM evaluation and other data-collection environments.
7. Key Theoretical Results and Applications
- Asymptotic Properties: One-step debiased estimators are semiparametrically efficient and asymptotically normal for all GARS targets (Frauen et al., 29 Jan 2026).
- Algorithmic Guarantees: Two-phase generalized Borda and PTAS upgradations guarantee constant-factor and arbitrarily tight approximations for top-list aggregation (Mathieu et al., 2018).
- Empirical Validation: Studies on synthetic and real datasets (including Chatbot Arena and MT-Bench) demonstrate superiority of nonparametric GARS estimators over plug-in approaches, valid confidence reports, and actionable ranking differences in practical leaderboards.
- Judge Augmentation: Incorporation of high-quality external judges as features yields substantial reduction in estimation error.
- Robustness to Misspecification: Nonparametric GARS projection methods outperform strictly parametric (e.g., BT) scoring under model violation.
GARS provides a flexible, theoretically principled, and computationally efficient framework for preference-based ranking and aggregation across diverse technical domains, with rigorous guarantees both algorithmic and statistical (Frauen et al., 29 Jan 2026, Mathieu et al., 2018).