Opponent-Adjusted Evaluation of NFL Pass Blocking and Pass Rushing Performance

Published 2 Apr 2026 in stat.AP | (2604.01491v1)

Abstract: Evaluating offensive linemen and pass rushers at the player level is difficult because observable outcomes are sparse, opponent-dependent, and strongly shaped by surrounding context. Using 2021 regular-season Hudl tracking data, we construct a blocker-rusher interaction dataset and estimate two ridge-regularized Bradley-Terry paired-comparison models: a binary win/loss model aligned with the 2.5-second pass block win-rate definition and a four-class severity model over loss, win, hit, and sack, with both models incorporating a double-team indicator. The final dataset contains 153,138 interactions across 33,283 pass plays in 266 games. On an ordered 80/20 holdout split (test n = 30,628), both models improve on global baselines and modestly outperform stronger matchup baselines under log-loss evaluation, corresponding to relative log-loss reductions of about 0.24% to 1.21%. Game-level bootstrap resampling indicates that these gains are most stable for the win model and for the severity model relative to the global baseline, while the severity-versus-matchup comparison remains directionally positive but less certain. External comparison to 2021 AP All-Pro selections provides additional face validation on the learned rankings, with the severity model showing the strongest alignment to expert recognition. Overall, ridge-regularized Bradley-Terry models provide an interpretable opponent-adjusted framework for evaluating NFL pass protection and pass rush at the interaction level.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces ridge-regularized Bradley–Terry models to evaluate NFL pass blocking and pass rushing performance by adjusting for opponent quality.
It utilizes comprehensive tracking data and binary outcome definitions to derive interpretable rankings for both offensive and defensive linemen.
Results demonstrate improved predictive performance and alignment with expert evaluations, emphasizing the benefit of modeling outcome severity over raw metrics.

Opponent-Adjusted Evaluation of NFL Pass Blocking and Pass Rushing Performance

Motivation and Context

NFL pass protection and pass rushing have substantial impact on team success, but robust, interpretable metrics for evaluating individual linemen have proven elusive due to severe context-dependency and information sparsity. Traditional metrics (e.g., sacks, hits, pressures) inadequately disentangle individual contribution from outcome stochasticity and environmental factors. Modern player-tracking data enable fine-grained analysis of line engagements, but existing metrics like PBWR and STRAIN remain limited in their capacity for opponent and context adjustment.

This study introduces a paired-comparison modeling paradigm to address these gaps, leveraging a regularized Bradley–Terry (BT) framework to derive joint, opponent-adjusted ratings for blockers and rushers from 2021 NFL tracking data, capturing both overall win rates and outcome severity.

Dataset and Outcome Construction

The analysis is based on 153,138 blocker–rusher interactions from 33,283 pass plays during the 2021 regular season, as recorded by Hudl’s player tracking at 10Hz. Interactions are defined by labeled engagements, incorporating double-team status as a key covariate. Outcomes are labeled from the rusher's perspective under two schemes:

Binary Win/Loss: A win is assigned if the rusher is closer to the quarterback than the blocker within 2.5 seconds of the snap.
Multinomial Severity: Four classes in increasing “defensive” severity—loss, win (pressure, no contact), hit, sack—with each engagement assigned the most severe label realized in the interaction.

Empirical marginal frequencies are heavily skewed: loss (0.73), win (0.25), hit (0.011), sack (0.0063). The four severity classes are mapped to unit interval weights anchored to EPA values— $w(loss)=0$ , $w(win)=0.10$ , $w(hit)=0.20$ , $w(sack)=1.00$ —for computing expected severity.

Modeling Framework

Two BT models are estimated:

Binary Ridge-regularized BT: Models the log-odds of win probability as a function of rusher and blocker latent effects with an explicit double-team term, $\text{logit}~P(\text{win}) = \alpha + r - b + \delta D$ .
Multinomial Ridge-regularized BT: Extends the model to four outcome classes, fitting class logits for each engagement. Regularization is critical to control variance from uneven exposure and incomplete matchup graphs.

Training/test splits are deterministic and ordered, and all models are benchmarked against strong baselines:

Global Baselines: Role-agnostic empirical frequencies.
Matchup Baselines: Player-specific, smoothed historical encounter frequencies with empirical Bayes shrinkage; no explicit latent skill estimation.

Model Validation and Empirical Performance

Holdout log-loss quantifies predictive accuracy:

The binary win/loss BT model reduces log-loss by 0.24% over the matchup baseline and 1.21% over the global baseline.
The multinomial severity BT model shows similar improvements, with gains of 0.24% (vs. matchup) and 1.20% (vs. global).
Bootstrap analysis confirms the statistical stability of improvements for the win/loss model and relative to the global baseline for severity; severity vs. matchup is directionally positive but less robust.

Player Ratings: Distribution and Leaderboards

The BT framework yields interpretable player scores. Score distributions reflect internal separation for blockers and rushers across both tasks.

Figure 1: Distribution of BT scores by model and role, indicating the separation of player ability as estimated by the models for each role.

Top-10 leaderboards (minimum 200 interactions) highlight elite performers by BT score.

Figure 2: Top 10 players by BT score in each model-role panel with bootstrap-derived 50% uncertainty intervals, supporting the empirical separation in role-specific talent.

Scores track interaction-level efficiency, conditional on assigned matchups, and are not designed as all-snap value measures. Consistency between win/loss and severity leaderboards is observed among elite rushers; for blockers, severity emphasizes high-impact prevention.

Alignment with External Awards

Externally, the severity-based BT rankings show strong concordance with 2021 AP All-Pro selections. Discriminative power is quantified via AUC and enrichment@ $K$ metrics, both outperforming empirical-rate-based benchmarks:

Severity model leads AUC and enrichment@ $K$ in most role/accolade slices.
Severity-based rankings show the clearest alignment with expert recognition, particularly among blockers.

Longitudinal Analysis: Week-by-Week Score Dynamics

Weekly path bootstraps provide uncertainty quantification and temporal evolution of top performers’ scores.

Figure 3: Weekly cumulative BT score paths with bootstrap uncertainty ribbons for the top three players in each model-role panel, illustrating convergence and variance over the season.

Cumulative exposure monotonically refines estimates, and initial weeks exhibit pronounced uncertainty.

Theoretical Implications

Opponent-adjusted paired-comparison methods, via regularized BT models, produce interpretable, role-conditional player value estimates from large-scale tracking data.
Multinomial severity modeling, anchored in outcome value (EPA), delivers improvements in aligning model-based assessment with expert human judgment.
The explicit inclusion of context features (e.g., double-team indicators) and posterior uncertainty quantification augments model transparency and robustness.

Limitations and Prospects for Extension

Key limitations include the potential crudeness of the distance-based win proxy, incomplete context capture (e.g., only coarse double-team encoding, little quarterback/play-level conditioning), and lack of explicit modeling for teammate effects or assignment structure. Future work could extend the BT framework with hierarchical shrinkage, richer outcome definitions utilizing advanced pocket geometry and QB decision tracking, and multi-season or multi-level pooling to enhance rating stability and interpretability.

Conclusion

Ridge-regularized BT models, fit to high-resolution tracking data, yield opponent-adjusted, interpretable ratings for NFL offensive and defensive linemen, outperforming competitive baselines in both binary and severity-based tasks. Severity modeling exhibits especially strong external validity, aligning more closely with independent All-Pro selections. These findings suggest that integrating outcome granularity and opponent structure into line play metrics meaningfully advances both the theoretical and practical measurement of pass blocking and pass rushing performance.

Markdown Report Issue