Group Conditional Unbiased Logistic Regression
- GCULR is a method that enforces group fairness by minimizing misclassification disparities across sensitive groups using logistic regression.
- It employs a dynamic Bayesian framework with EKF filtering and Monte Carlo sampling, alongside a convex MRF approach with group-norm regularization.
- Empirical evaluations on synthetic and real-world data demonstrate GCULR’s ability to maintain fairness with minimal loss in predictive accuracy.
Group Conditional Unbiased Logistic Regression (GCULR) encompasses a class of constrained logistic regression methodologies designed to yield predictions that are unbiased with respect to predefined group variables. GCULR is principally motivated by mitigating disparate misclassification, particularly in settings where sensitive covariates (such as race or gender) induce differences in error rates. Implementations of GCULR include a fully Bayesian sequential approach for online classification tracking with fairness guarantees (Short et al., 2020) as well as a convex group-norm-regularized estimator for high-dimensional graphical model recovery (Wu et al., 2018).
1. Model Specification and Core Principle
GCULR explicitly enforces group-level fairness constraints in the learning process of logistic regression models, ensuring that prediction error rates (e.g., false positive/negative rates) are approximately equal across sensitive groups.
In the dynamic Bayesian framework (Short et al., 2020), the logistic regression model is formulated as: where each parameter vector evolves according to a state-space random walk:
Additionally, for each group , the conditional feature distribution is modeled as with a Normal-inverse-Wishart conjugate prior, permitting online updating.
Alternately, in the Markov random field (MRF)/structured prediction context (Wu et al., 2018), GCULR consists of solving, for each node and every pair of states , a group-norm-regularized logistic regression: Here, denotes the sum of the norms of row groups, ensuring group-sparse solutions matched to the graphical structure.
2. Fairness Constraint Formulation
GCULR enforces fairness by constraining group-conditional misclassification disparities. In the dynamic Bayesian setting, the group-conditional false-negative and false-positive rates are defined as:
with the scalar bias metric
GCULR imposes the hard constraint at every update step. After updating posteriors, candidate parameter draws are Monte Carlo sampled and only those satisfying both the fairness constraint and minimum relative accuracy are retained. The resulting fair-constrained posterior is defined by the empirical mean and covariance of accepted samples.
In the MRF structure learning context, “unbiasedness” refers to an underlying distribution property ensuring every conditional entry is bounded below by a function of model width and alphabet size, a requirement critical for identifiability and generalization (Wu et al., 2018).
3. Algorithmic Workflow and Computational Methods
Bayesian Online Tracking (Dynamic GCULR)
The tracking algorithm involves:
- Bayesian filtering of logistic regression parameters via the Extended Kalman Filter (EKF), using the linearized log-likelihood update equations:
where
- Monte Carlo estimation of group-conditional errors using samples from the predicted covariate distributions for each group.
- Iterative rejection sampling of posterior parameter draws to satisfy fairness and accuracy constraints, followed by propagation of constrained and unconstrained posteriors.
Convex Group-Norm Optimization (MRF GCULR)
- For each variable and state pair, one-hot encoding is performed on the features of subsetted samples.
- The core convex problem involves penalization by an group norm, solved using first-order mirror descent with a special distance-generating function; this ensures efficient optimization at scale.
- For variables with alphabet size , the total complexity is for fixed problem parameters, representing a significant computational improvement over prior art.
4. Statistical Guarantees and Theoretical Properties
No formal theorem is stated for the Bayesian online algorithm, but standard EKF and random-walk regularity assumptions yield:
- Consistency: parameter mean converges in probability to the true value for data generated by a logistic model.
- Fairness: enforcement of guarantees bounded disparate misclassification at all times.
- Accuracy bound: the minimum group accuracy remains at least fraction of the unconstrained solution (Short et al., 2020).
In the discrete graphical modeling framework:
- The key codependence between prediction risk and parameter error is quantified via population risk and Kullback–Leibler divergence. Specifically, with samples, the estimator satisfies with high probability, enabling exact structure recovery when (Wu et al., 2018).
- Unbiasedness is formally associated with the property that for all , , guaranteeing nondegeneracy in the conditioning structure.
5. Implementation and Hyperparameter Selection
Key hyperparameters for GCULR include:
| Parameter | Role in GCULR (Short et al., 2020) | Recommendation |
|---|---|---|
| Process noise covariance for | Set based on system dynamics | |
| Bias tolerance for | Chosen to balance fairness/accuracy | |
| Relative accuracy threshold | Typically | |
| Monte Carlo samples per group | Sufficient for integral accuracy | |
| Posterior samples for rejection | Large for tight constraint | |
| Normal-inverse-Wishart prior | Large for tracking |
All model updates, rejection sampling, and posterior propagation steps are explicitly stated in Algorithm 1 in (Short et al., 2020). For the high-dimensional case, pseudocode follows the outlined samples, encoding, mirror-descent, and thresholding workflow (Wu et al., 2018).
6. Empirical Evaluation and Comparative Analysis
Static Synthetic Data
GCULR, applied to data from two Gaussian clusters, reduced group FPR (from to ) and FNR (from to ), with an overall accuracy drop from approximately $0.68$ to $0.62$. The Zafar et al. baseline yielded less balance: FPR , FNR (Short et al., 2020).
Dynamic Synthetic Data
When group means swap over time, ordinary logistic regression exhibits fluctuating instantaneous bias while GCULR maintains at every point, adapting to the evolving fairness boundary in real time.
ProPublica COMPAS Evaluation
On 5,278 criminal justice records, unconstrained GCULR achieved an accuracy of but with a disparity . Imposing GCULR with yields disparity and accuracy . Competing methods (Zafar baseline) cannot achieve comparable (tight) fairness without severe accuracy trade-offs or trivial classification (Short et al., 2020).
MRF Graphical Model Recovery
In experiments with grid graphs, for , GCULR consistently recovered true structure using fewer samples than the online Sparsitron, attributable to superior sample complexity versus of earlier approaches (Wu et al., 2018).
7. Extensions and Notable Properties
GCULR with group norm constraints () generalizes binary, -constrained regression (Ising models) to -ary alphabets. The group norm approach yields statistically and computationally preferable rates, particularly for high-dimensional problems. GCULR permits finite-sample, high-probability performance guarantees, efficient optimization, and certified fairness in dynamic and stationary regimes. In both Bayesian tracking and graphical model settings, the approach is robust to nonstationarities and provides posterior uncertainty quantification at each update (Short et al., 2020, Wu et al., 2018).