Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semiparametric analysis for paired comparisons with covariates

Published 31 Mar 2026 in stat.ME | (2603.29333v1)

Abstract: Statistical inference in parametric models (e.g., the Bradley--Terry model and its variants) for paired-comparison data has been explored in the high-dimensional regime, in which the number of items involving in paired comparisons diverges. However, parametric models are highly susceptible to model misspecification. To relax the assumption of known distributions and provide flexibility, we propose a semiparametric framework for modeling the merits of items and covariate effects (e.g., home-field advantage) by introducing latent random variables with unspecified distributions. As the number of parameters increases with the number of items, semiparametric inference is highly nontrivial. To address this issue, we employ a kernel-based least squares approach to estimate all unknown parameters. When each pair of items has a fixed number of comparisons and the number of items tends to infinity, we prove the consistency of all resulting estimators and derive their asymptotic normal distributions. To the best of our knowledge, this is the first study to conduct a semiparametric analysis of paired comparisons with an increasing dimension. We conduct simulations to evaluate the finite-sample performance of the proposed method and illustrate its practical utility by analyzing an NBA dataset.

Summary

  • The paper introduces a semiparametric framework that relaxes distributional assumptions on latent errors in paired comparison models.
  • It leverages kernel-based least squares and special regressors to achieve robust parameter estimation and identifiability in high-dimensional settings.
  • Empirical results, including an NBA application, demonstrate reduced bias and improved inference compared to traditional parametric methods.

Semiparametric Inference for High-Dimensional Paired Comparisons with Covariates

Introduction

The paper "Semiparametric analysis for paired comparisons with covariates" (2603.29333) presents a framework for semiparametric estimation in paired comparison models with covariates, addressing the limitations of parametric specifications such as the Bradley–Terry and Thurstone models, particularly under model misspecification and in high-dimensional regimes where the number of items grows. The proposed methodology advances statistical inference in this context by introducing latent random variables with unspecified distributions and leveraging kernel-based least squares for estimation, thereby robustifying paired comparison analysis to distributional assumptions on latent noise.

Model Formulation and Identification

The central model extends classical paired comparison schemes by positing that, for items ii and jj, and their ttth comparison, the response aijta_{ijt} indicating ii beats jj is conditionally Bernoulli with:

P(aijt=1∣Xijt,γ,θi,θj)=F(θi−θj+Xijt⊤γ),P(a_{ijt} = 1 \mid \bm{X}_{ijt}, \gamma, \theta_i, \theta_j) = F(\theta_i - \theta_j + \bm{X}_{ijt}^\top \gamma),

where FF is a strictly increasing CDF satisfying F(x)+F(−x)=1F(x) + F(-x) = 1, the θ\theta's are item-dependent merit parameters, and jj0 encodes the effect of covariates jj1 (e.g., home-field advantage, rest days). Notably, the distribution of latent error is left unspecified, unlike classical approaches assuming logistic or normal errors.

The identifiability of model parameters is nontrivial due to scale and shift invariance, and dependency on the support of the covariate distribution. The authors leverage the "special regressor" method to anchor identification: a continuous covariate with full support and positive coefficient is required, and normalization constraints (e.g., jj2) are imposed. Conditions for identifiability further require independence and symmetry of the latent noise and technical assumptions on the support and eigenstructure of the projected covariate design. The explicit use of a special regressor enables semiparametric identification even as the number of items diverges.

Estimation Procedure

Estimation proceeds by constructing a kernel-based plug-in estimator for the conditional density jj3 of the special regressor and forming a pseudo-outcome correction jj4 using importance weighting analogously to semiparametric methods in network models. The vector of parameters is then estimated by a penalized least squares fit to the projected outcome:

jj5

jj6

where jj7 encodes the pairwise contrasts, jj8 is the associated Laplacian, and jj9 projects onto the orthogonal complement of the space spanned by the merit contrasts. Kernel estimation of tt0 is performed using Nadaraya–Watson estimators with adaptive bandwidth selected via cross-validation on pseudo-moments.

The computational complexity is tt1, which, while polynomial, may hinder scalability in much larger item sets; however, the method is practically feasible for sample sizes typical in sports analytics or journal ranking contexts.

Theoretical Guarantees

The authors establish strong theoretical properties for the proposed estimators as tt2:

  • Consistency: Under mild regularity and smoothness conditions on the covariates, density, and kernel, both tt3 and tt4 are uniformly consistent.
  • Asymptotic Normality: Central limit theorems are provided for linear contrasts of the merit parameters and regression coefficients, with explicit (albeit asymptotic) variance formulas capturing bias/variance trade-offs from kernel smoothing.

These results generalize existing parametric analyses by removing the requirement for known latent error distributions; identifiability and inference remain valid under quite general noise generative models. The proofs adapt and extend semiparametric identification arguments from the social network literature.

Numerical Results and Empirical Application

Simulations study finite-sample properties under various error distributions (normal, logistic, mixtures). The kernel-based semiparametric estimators yield negligible bias and appropriate frequentist coverage even under model misspecification, substantially outperforming maximum likelihood estimates from mis-specified parametric models (e.g., Bradley-Terry-Logistic) in terms of bias robustness. For parameter values linked to covariates, both methods perform comparably.

In application to 2018–2019 NBA data, the approach uncovers quantifiable home-field advantage (tt5, tt6) and a non-significant effect for back-to-back matches, aligning with domain knowledge. Estimated merit parameters recover plausible rankings, suggesting the method's usefulness in real-world paired competition evaluation.

Implications and Future Directions

The proposed framework addresses an important gap in paired comparison analysis, enabling inference when standard distributional assumptions are unwarranted or suspect. In practice, this strengthens the case for robust merit estimation in settings such as sports analytics, crowd labeling, recommendation systems, and journal rankings.

Theoretically, the work suggests avenues for generalizing paired comparison inference under relaxed assumptions, exploring more complex sparsity patterns (non-complete graphs), and extending to multiway or ordinal outcomes where nonparametric identification is challenging. On the algorithmic side, explicit closed-form covariance formulas facilitate uncertainty quantification and downstream statistical testing.

Potential developments include:

  • Extending the semiparametric framework to sparse graphs (ErdÅ‘s–Rényi, degree-heterogeneous models), where invertibility of the information matrix is problematic.
  • Model averaging or adaptive special regressor selection, though dependent estimator correlation complicates theoretical analysis.
  • Addressing computational challenges for larger tt7 via randomized algorithms or distributed estimation schemes.

Conclusion

This work provides a rigorous, flexible semiparametric methodology for high-dimensional paired comparison analysis with covariates, avoiding the risks of model misspecification and equipping practitioners with robust tools for merit and covariate effect estimation. The technical framework, theoretical guarantees, and empirical evidence collectively demonstrate its value over parametric alternatives, especially in real-world scenarios where distributional assumptions are unclear or easily violated.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.