Semiparametric analysis for paired comparisons with covariates

Published 31 Mar 2026 in stat.ME | (2603.29333v1)

Abstract: Statistical inference in parametric models (e.g., the Bradley--Terry model and its variants) for paired-comparison data has been explored in the high-dimensional regime, in which the number of items involving in paired comparisons diverges. However, parametric models are highly susceptible to model misspecification. To relax the assumption of known distributions and provide flexibility, we propose a semiparametric framework for modeling the merits of items and covariate effects (e.g., home-field advantage) by introducing latent random variables with unspecified distributions. As the number of parameters increases with the number of items, semiparametric inference is highly nontrivial. To address this issue, we employ a kernel-based least squares approach to estimate all unknown parameters. When each pair of items has a fixed number of comparisons and the number of items tends to infinity, we prove the consistency of all resulting estimators and derive their asymptotic normal distributions. To the best of our knowledge, this is the first study to conduct a semiparametric analysis of paired comparisons with an increasing dimension. We conduct simulations to evaluate the finite-sample performance of the proposed method and illustrate its practical utility by analyzing an NBA dataset.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a semiparametric framework that relaxes distributional assumptions on latent errors in paired comparison models.
It leverages kernel-based least squares and special regressors to achieve robust parameter estimation and identifiability in high-dimensional settings.
Empirical results, including an NBA application, demonstrate reduced bias and improved inference compared to traditional parametric methods.

Semiparametric Inference for High-Dimensional Paired Comparisons with Covariates

Introduction

The paper "Semiparametric analysis for paired comparisons with covariates" (2603.29333) presents a framework for semiparametric estimation in paired comparison models with covariates, addressing the limitations of parametric specifications such as the Bradley–Terry and Thurstone models, particularly under model misspecification and in high-dimensional regimes where the number of items grows. The proposed methodology advances statistical inference in this context by introducing latent random variables with unspecified distributions and leveraging kernel-based least squares for estimation, thereby robustifying paired comparison analysis to distributional assumptions on latent noise.

Model Formulation and Identification

The central model extends classical paired comparison schemes by positing that, for items $i$ and $j$ , and their $t$ th comparison, the response $a_{ijt}$ indicating $i$ beats $j$ is conditionally Bernoulli with:

$P(a_{ijt} = 1 \mid \bm{X}_{ijt}, \gamma, \theta_i, \theta_j) = F(\theta_i - \theta_j + \bm{X}_{ijt}^\top \gamma),$

where $F$ is a strictly increasing CDF satisfying $F(x) + F(-x) = 1$ , the $\theta$ 's are item-dependent merit parameters, and $j$ 0 encodes the effect of covariates $j$ 1 (e.g., home-field advantage, rest days). Notably, the distribution of latent error is left unspecified, unlike classical approaches assuming logistic or normal errors.

The identifiability of model parameters is nontrivial due to scale and shift invariance, and dependency on the support of the covariate distribution. The authors leverage the "special regressor" method to anchor identification: a continuous covariate with full support and positive coefficient is required, and normalization constraints (e.g., $j$ 2) are imposed. Conditions for identifiability further require independence and symmetry of the latent noise and technical assumptions on the support and eigenstructure of the projected covariate design. The explicit use of a special regressor enables semiparametric identification even as the number of items diverges.

Estimation Procedure

Estimation proceeds by constructing a kernel-based plug-in estimator for the conditional density $j$ 3 of the special regressor and forming a pseudo-outcome correction $j$ 4 using importance weighting analogously to semiparametric methods in network models. The vector of parameters is then estimated by a penalized least squares fit to the projected outcome:

$j$ 5

$j$ 6

where $j$ 7 encodes the pairwise contrasts, $j$ 8 is the associated Laplacian, and $j$ 9 projects onto the orthogonal complement of the space spanned by the merit contrasts. Kernel estimation of $t$ 0 is performed using Nadaraya–Watson estimators with adaptive bandwidth selected via cross-validation on pseudo-moments.

The computational complexity is $t$ 1, which, while polynomial, may hinder scalability in much larger item sets; however, the method is practically feasible for sample sizes typical in sports analytics or journal ranking contexts.

Theoretical Guarantees

The authors establish strong theoretical properties for the proposed estimators as $t$ 2:

Consistency: Under mild regularity and smoothness conditions on the covariates, density, and kernel, both $t$ 3 and $t$ 4 are uniformly consistent.
Asymptotic Normality: Central limit theorems are provided for linear contrasts of the merit parameters and regression coefficients, with explicit (albeit asymptotic) variance formulas capturing bias/variance trade-offs from kernel smoothing.

These results generalize existing parametric analyses by removing the requirement for known latent error distributions; identifiability and inference remain valid under quite general noise generative models. The proofs adapt and extend semiparametric identification arguments from the social network literature.

Numerical Results and Empirical Application

Simulations study finite-sample properties under various error distributions (normal, logistic, mixtures). The kernel-based semiparametric estimators yield negligible bias and appropriate frequentist coverage even under model misspecification, substantially outperforming maximum likelihood estimates from mis-specified parametric models (e.g., Bradley-Terry-Logistic) in terms of bias robustness. For parameter values linked to covariates, both methods perform comparably.

In application to 2018–2019 NBA data, the approach uncovers quantifiable home-field advantage ( $t$ 5, $t$ 6) and a non-significant effect for back-to-back matches, aligning with domain knowledge. Estimated merit parameters recover plausible rankings, suggesting the method's usefulness in real-world paired competition evaluation.

Implications and Future Directions

The proposed framework addresses an important gap in paired comparison analysis, enabling inference when standard distributional assumptions are unwarranted or suspect. In practice, this strengthens the case for robust merit estimation in settings such as sports analytics, crowd labeling, recommendation systems, and journal rankings.

Theoretically, the work suggests avenues for generalizing paired comparison inference under relaxed assumptions, exploring more complex sparsity patterns (non-complete graphs), and extending to multiway or ordinal outcomes where nonparametric identification is challenging. On the algorithmic side, explicit closed-form covariance formulas facilitate uncertainty quantification and downstream statistical testing.

Potential developments include:

Extending the semiparametric framework to sparse graphs (Erdős–Rényi, degree-heterogeneous models), where invertibility of the information matrix is problematic.
Model averaging or adaptive special regressor selection, though dependent estimator correlation complicates theoretical analysis.
Addressing computational challenges for larger $t$ 7 via randomized algorithms or distributed estimation schemes.

Conclusion

This work provides a rigorous, flexible semiparametric methodology for high-dimensional paired comparison analysis with covariates, avoiding the risks of model misspecification and equipping practitioners with robust tools for merit and covariate effect estimation. The technical framework, theoretical guarantees, and empirical evidence collectively demonstrate its value over parametric alternatives, especially in real-world scenarios where distributional assumptions are unclear or easily violated.

Markdown Report Issue