Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalized Average Ranking Scores (GARS)

Updated 30 January 2026
  • Generalized Average Ranking Scores (GARS) is a versatile, nonparametric ranking framework that aggregates diverse preference data through flexible mapping functions.
  • It extends classical methods like Borda, Bradley–Terry, and PageRank by effectively handling incomplete top-lists and multicategory outcomes.
  • GARS employs efficient approximation and debiased machine learning techniques to achieve robust statistical inference and scalable computation.

Generalized Average Ranking Scores (GARS) constitute a unified, nonparametric class of ranking metrics designed to aggregate preference data across a range of evaluation scenarios, including incomplete top-lists and pairwise comparisons with ties or multicategorical outcomes. GARS generalizes classical rank aggregation frameworks such as Borda, Bradley–Terry, and Rank Centrality by treating the item ranking problem as the task of mapping a family of itemwise pair preference probabilities to a low-dimensional score vector via a user-specified function. Efficient inferential and computational methods—such as constant-factor approximation algorithms or semiparametrically efficient estimators via debiased machine learning—are available for GARS in large-scale aggregation and evaluation contexts (Frauen et al., 29 Jan 2026, Mathieu et al., 2018).

1. Formal Definition and Motivation

Let KK denote the number of items to be ranked. For each context XXX \in \mathcal{X}, all ordered pairs (j,k)(j,k) where j,k{1,,K}j, k \in \{1,\ldots,K\} are considered; for each pair, a categorical label YjkcY_{jkc} is obtained, indicating the outcome (‘jj beats kk’, ‘kk beats jj’, ‘tie’, etc.), with CC denoting the number of response categories. Write μjkc(x)=P(Yjkc=1X=x)\mu_{jkc}(x) = P(Y_{jkc} = 1 \mid X = x) and μ(x)[0,1]K×K×C\mu(x) \in [0,1]^{K \times K \times C} collecting all such conditional probabilities.

A Generalized Average Ranking Score (GARS) is specified as the expectation of a user-chosen mapping F:[0,1]K×K×CRdF : [0,1]^{K \times K \times C} \rightarrow \mathbb{R}^d, so that

θ=E[F(μ(X))]Rd\theta = E[F(\mu(X))] \in \mathbb{R}^d

where dd is typically KK or larger. The form of FF can instantiate classical ranking methods or incorporate more flexible, application-specific metrics (Frauen et al., 29 Jan 2026).

2. Special Cases: Classical Rank Aggregation and Top-List Algorithms

GARS encompasses and extends several established ranking models:

  • Borda (Average-Win-Rate): For binary comparisons (C=2C = 2), the Borda score for item jj is

Fj(μ(x))=12(K1)kj[μjk,1(x)+μkj,2(x)]F_j(\mu(x)) = \frac{1}{2(K-1)} \sum_{k \neq j} [\mu_{jk,1}(x) + \mu_{kj,2}(x)]

The overall score vector θj=E[Fj(μ(X))]\theta_j = E[F_j(\mu(X))] encodes the average likelihood jj prevails over kk across prompts.

  • Bradley–Terry Projections: For inference from log-odds,

jk(x)=logit(μjk,1(x))+logit(μkj,2(x))2\ell_{jk}(x) = \frac{\text{logit}(\mu_{jk,1}(x)) + \text{logit}(\mu_{kj,2}(x))}{2}

With appropriate incidence and projection matrices (BB, L0L_0, HH), GARS yields projected latent quality scores consistent with the BT model when its assumptions hold (Frauen et al., 29 Jan 2026).

  • Rank Centrality/PageRank: GARS can represent stationary distributions over itemwise symmetrized transition matrices, so that the score vector s(x)s(x) solves s(x)T(x)s(x)s(x) \propto T(x)^\top s(x), where Tij(x)T_{ij}(x) is derived from preference probabilities. This accommodates PageRank-type aggregation (Frauen et al., 29 Jan 2026).
  • Incomplete Top-List Aggregation: When only a subset of candidates is ranked in each input list, the pair (s(v),r(v))(s(v), r(v)) for vVv \in V is defined by

s(v)=PrTp[rankT(v)<],r(v)=1s(v)T:rankT(v)<p(T)rankT(v)s(v) = \Pr_{T \sim p}[rank_T(v) < \infty], \qquad r(v) = \frac{1}{s(v)} \sum_{T:rank_T(v)<\infty} p(T) \cdot rank_T(v)

The GARS vector for vv is (s(v),r(v))(-s(v), r(v)), with items sorted lexicographically to produce a generalized Borda total order (Mathieu et al., 2018).

3. Algorithmic Construction and Approximation Guarantees

GARS supports algorithmic ranking procedures with rigorous worst-case approximation bounds. In the context of top-list aggregation (Mathieu et al., 2018):

  • Two-Phase Generalized Borda Algorithm: Compute s(v)s(v) and r(v)r(v) for all items; sort by non-increasing s(v)s(v), breaking ties by non-decreasing r(v)r(v). This produces a ranking that is a constant-factor ($6$-approximation) solution to the top-list aggregation objective measured by expected Kendall-τ\tau distance. The pseudocode is efficient, running in O(TT+nlogn)O(\sum_T|T| + n\log n) time.
  • PTAS Enhancement: By bucketing items according to s(v)s(v) and using the Mathieu–Schudy PTAS for full-ranking aggregation within each bucket, a (1+ϵ)(1+\epsilon)-approximation is achieved in total time O(n3logn(1+log(1/ϵ))+nexp(exp(O(1/ϵ))))O(n^3\log n\cdot (1+\log(1/\epsilon)) + n\cdot \exp(\exp(O(1/\epsilon)))). This relies on concentration arguments and bucket-respecting optimality (Mathieu et al., 2018).
Method Approximation Ratio Time Complexity
Two-phase Generalized Borda 6 O(TT+nlogn)O(\sum_T|T| + n\log n)
SCORE-THEN-PTAS 1+ϵ1+\epsilon O(n3lognpolylog(1/ϵ)+nexp(exp(O(1/ϵ))))O(n^3\log n \cdot \text{polylog}(1/\epsilon) + n \cdot \exp(\exp(O(1/\epsilon))))

4. Semiparametric Theory and Efficient Inference

For modern LLM evaluation and similar noisy, high-dimensional contexts, GARS is equipped with semiparametric efficiency theory (Frauen et al., 29 Jan 2026):

ϕ(O,η,θ)=F(μ(X))θ+jk[Sjkπjk(X)]Jjk(μ(X))[Yjkμjk(X)]\phi(O,\eta,\theta) = F(\mu(X)) - \theta + \sum_{j \neq k} \left[\frac{S_{jk}}{\pi_{jk}(X)}\right] J_{jk}(\mu(X))[Y_{jk} - \mu_{jk}(X)]

Here Jjk(μ)J_{jk}(\mu) is the Jacobian of FF w.r.t. μjk\mu_{jk}, SjkS_{jk} is the pair labeling indicator, and πjk(x)=P(Sjk=1X=x)\pi_{jk}(x) = P(S_{jk}=1|X=x) is the context-dependent labeling probability. Under regularity, the debiased estimator

θ^EIF\hat{\theta}_{EIF}

is asymptotically normal and achieves the semiparametric lower variance bound. Joint and coordinatewise confidence regions are constructed from the empirical covariance of EIF values.

5. Estimation Procedures in Practice

Practical estimation of GARS from preference datasets is implemented via:

  • Cross-Fitting with Black-Box Learners: Data is divided into folds (V2V\geq2), with out-of-fold predictions for both μjk(x)\mu_{jk}(x) (categorical classifier) and πjk(x)\pi_{jk}(x) (binary classifier). Judges can be external machine learning models, incorporated as features (“judge-as-feature”).
  • Debiased One-Step Estimation: After nuisance prediction, scores are corrected via the EIF formula, yielding valid uncertainty quantification and robust estimates under both parametric and nonparametric conditions.
  • Handling Ties and Rich Labels: Multicategory classifiers predict responses, and category-weights in FF allow downstream methods (e.g., viewing ties as half-wins).

6. Optimal Data Acquisition under Budget Constraints

GARS enables principled policies for preference data acquisition:

  • A-Optimality: Minimizes total score variance. The optimal sampling policy for pair (j,k)(j,k) and context xx is

πjk(x)=clip[α,1]tr(Jjk(μ(x))Vjk(μ(x))Jjk(μ(x)))λAcjk\pi_{jk}^*(x) = \text{clip}_{[\alpha,1]}\sqrt{\frac{\text{tr}(J_{jk}(\mu(x)) V_{jk}(\mu(x)) J_{jk}(\mu(x))^\top)}{\lambda_A c_{jk}}}

Where Vjk(μ(x))V_{jk}(\mu(x)) is covariance of label, cjkc_{jk} is cost, and λA\lambda_A selected by budget constraint.

  • D-Optimality: Minimizes determinant of the covariance. The policy πjkD(x)\pi^D_{jk}(x) is described by a fixed-point equation involving the full covariance structure Σ(πD)\Sigma(\pi^D).
  • Single-Pair-Per-Context Constraints: Admits “capped water-filling” policies with per-context dual variables. This supports allocation planning in LLM evaluation and other data-collection environments.

7. Key Theoretical Results and Applications

  • Asymptotic Properties: One-step debiased estimators are semiparametrically efficient and asymptotically normal for all GARS targets (Frauen et al., 29 Jan 2026).
  • Algorithmic Guarantees: Two-phase generalized Borda and PTAS upgradations guarantee constant-factor and arbitrarily tight approximations for top-list aggregation (Mathieu et al., 2018).
  • Empirical Validation: Studies on synthetic and real datasets (including Chatbot Arena and MT-Bench) demonstrate superiority of nonparametric GARS estimators over plug-in approaches, valid confidence reports, and actionable ranking differences in practical leaderboards.
  • Judge Augmentation: Incorporation of high-quality external judges as features yields substantial reduction in estimation error.
  • Robustness to Misspecification: Nonparametric GARS projection methods outperform strictly parametric (e.g., BT) scoring under model violation.

GARS provides a flexible, theoretically principled, and computationally efficient framework for preference-based ranking and aggregation across diverse technical domains, with rigorous guarantees both algorithmic and statistical (Frauen et al., 29 Jan 2026, Mathieu et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generalized Average Ranking Scores (GARS).