Learning Optimized Risk Scores (1610.00168v5)

Published 1 Oct 2016 in stat.ML, math.OC, and stat.ME

Abstract: Risk scores are simple classification models that let users make quick risk predictions by adding and subtracting a few small numbers. These models are widely used in medicine and criminal justice, but are difficult to learn from data because they need to be calibrated, sparse, use small integer coefficients, and obey application-specific operational constraints. In this paper, we present a new machine learning approach to learn risk scores. We formulate the risk score problem as a mixed integer nonlinear program, and present a cutting plane algorithm for non-convex settings to efficiently recover its optimal solution. We improve our algorithm with specialized techniques to generate feasible solutions, narrow the optimality gap, and reduce data-related computation. Our approach can fit risk scores in a way that scales linearly in the number of samples, provides a certificate of optimality, and obeys real-world constraints without parameter tuning or post-processing. We benchmark the performance benefits of this approach through an extensive set of numerical experiments, comparing to risk scores built using heuristic approaches. We also discuss its practical benefits through a real-world application where we build a customized risk score for ICU seizure prediction in collaboration with the Massachusetts General Hospital.

Citations (76)

View on Semantic Scholar

Summary

The paper introduces RiskSLIM, a novel approach that formulates risk scoring as a mixed-integer nonlinear program, optimizing calibration, ranking, and sparsity.
It employs the Lattice Cutting Plane Algorithm (LCPA) that integrates branch-and-bound with LP relaxations to efficiently handle non-convex constraints.
Empirical tests, including an ICU seizure prediction case, demonstrate that RiskSLIM outperforms traditional heuristics in achieving reliable calibration and superior AUC.

This paper introduces a novel machine learning approach, named RiskSLIM (Risk-calibrated Supersparse Linear Integer Model), for building risk scores. Risk scores are simple linear models with sparse, small integer coefficients, widely used in domains like medicine, criminal justice, and finance due to their interpretability and ease of use (e.g., the CHADS $_2$ score for stroke risk). However, creating optimal risk scores is challenging because they must satisfy multiple criteria simultaneously: good calibration (predicted risks match observed risks), high rank accuracy (AUC), sparsity (few features), small integer coefficients, and often, application-specific operational constraints (like monotonicity or feature relationships).

Traditional methods often rely on heuristics, expert judgment, or post-processing steps like rounding coefficients from logistic regression models. These ad-hoc approaches may lead to suboptimal performance, violate constraints, or lack guarantees about how close the resulting score is to the best possible one.

Problem Formulation:

The paper formulates the task of learning a risk score as a mixed-integer nonlinear program (MINLP), referred to as the Risk Score Problem or RiskSlimMINLP:

1 2	min_{λ} L(λ; D) + C₀ \|\|λ\|\|₀ s.t. λ ∈ L

where:

$λ$ is the vector of coefficients (points) including an intercept.
$L(λ; D) = (1/n) * Σ log(1 + exp(-yᵢ * λᵀxᵢ))$ is the normalized logistic loss function calculated over the training data $D = \{(x_i, y_i)\}_{i=1}^n$ . Minimizing this promotes calibration and indirectly AUC.
$||λ||₀ = Σ_{j=1}^d 1_{λⱼ≠0}$ is the L0-seminorm, penalizing the number of non-zero coefficients (excluding the intercept) to enforce sparsity.
$C₀ > 0$ is a regularization parameter balancing the trade-off between loss and sparsity.
$L ⊂ Z^{d+1}$ is the user-defined feasible set for coefficients, typically restricting them to small integers (e.g., $\{-5, ..., 5\}$ ) and encoding operational constraints.

Solving this MINLP optimally yields a risk score that is demonstrably the best possible model (on the training data) satisfying all specified constraints for the chosen $C₀$ .

Methodology: Lattice Cutting Plane Algorithm (LCPA)

Directly solving the RiskSlimMINLP with standard MINLP solvers is often computationally intractable, especially for larger datasets, due to the non-convex L0 penalty and integer constraints. Traditional cutting plane algorithms (CPA), while effective for convex problems and scalable in sample size $n$ , tend to stall in non-convex settings because they require solving an increasingly complex non-convex surrogate problem (a MIP) to optimality at each iteration.

To overcome this, the paper proposes the Lattice Cutting Plane Algorithm (LCPA):

Combines Branch-and-Bound (B&B) and Cutting Planes: LCPA embeds the cutting plane generation within a B&B search framework typically used for integer programming.
Solves LP Relaxations: Instead of solving a full MIP at each iteration like traditional CPA, LCPA solves a linear programming (LP) relaxation ( $RiskSlimLP$ ) at each node of the B&B tree. The LP uses the current cutting plane approximation of the logistic loss.
Adds Cuts at Integer Solutions: When the LP solution at a B&B node happens to be integer-feasible (i.e., satisfies $λ ∈ L$ ), a new cutting plane (a supporting hyperplane to the true logistic loss function at this point $λ$ ) is generated and added to the approximation used in subsequent LPs.
Branching: If the LP solution is fractional, standard B&B branching rules are applied (e.g., splitting the feasible region based on a fractional variable).
Bounds: The lower bound on the optimal MINLP objective is the minimum LP objective value across all active B&B nodes. The upper bound is the objective value of the best integer solution found so far.
Implementation: LCPA can be efficiently implemented using standard MIP solvers (like CPLEX, Gurobi) that support control callbacks and lazy constraints. Lazy constraints are crucial, as they allow the solver to manage a large number of cuts without evaluating all of them at every LP solve, significantly speeding up the process.

LCPA avoids the stalling issue of CPA because it doesn't require solving the non-convex surrogate MIP to optimality repeatedly. It maintains the benefits of CPA, like linear scaling in sample size $n$ (data is only used to compute cut parameters) and the ability to leverage powerful MIP solver technology.

Algorithmic Improvements:

Several techniques are introduced to accelerate LCPA and improve solution quality:

Discrete Coordinate Descent (DCD): A local search heuristic to "polish" integer solutions found during the search. It iteratively adjusts one coefficient at a time to greedily minimize the RiskSlimMINLP objective, ensuring the solution is 1-opt (cannot be improved by changing a single coefficient).
SequentialRounding: A heuristic to generate high-quality integer solutions from fractional LP solutions. It iteratively rounds fractional components up or down, choosing the direction that minimizes the RiskSlimMINLP objective at each step. This is used in conjunction with DCD.
ChainedUpdates: A fast bound-tightening procedure. It uses the current best upper bound ( $max$ ) and overall lower bound ( $min$ ) on the MINLP objective to derive tighter bounds on the logistic loss ( $L$ ) and the L0-norm ( $R$ ) components, and vice-versa. These tighter bounds ( $L_{min}, L_{max}, R_{min}, R_{max}$ ) prune the B&B search space more effectively.
Initialization: A warm-start procedure runs a relaxed version of CPA (using LPs) initially to generate a good set of starting cuts and initial bounds.
Fast Loss Evaluation: Uses a lookup table for $\log(1 + \exp(-s))$ when scores $s = λᵀx$ are discrete and bounded, speeding up loss/gradient calculations needed for cuts and heuristics.
Subsampling: Uses a smaller subset of the data to run computationally intensive heuristics like SequentialRounding more frequently, with theoretical guarantees (using Hoeffding-Serfling inequality) to relate performance on the subset to performance on the full dataset.

Experiments and Findings:

Performance Comparison: On public datasets, RiskSLIM consistently produced risk scores with better calibration (CAL) and AUC compared to traditional methods (logistic regression + rounding/unit weighting/rescaled rounding) and new pooled heuristic methods (applying heuristics to a pool of models from Elastic Net). RiskSLIM models achieved lower logistic loss, correlating with better performance.
Pitfalls of Heuristics: Traditional rounding methods often degrade calibration significantly (e.g., rescaling, unit weights) or hurt AUC (rounding small coefficients to zero). These issues are hard to fix with post-processing like Platt scaling. Pooled methods offer improvement but are still outperformed by direct optimization.
Computation: LCPA with improvements solved most benchmark problems to optimality or near-optimality (small gap) within minutes, demonstrating practical feasibility despite the NP-hardness of the problem. Runtime scales linearly with sample size $n$ .
Generalization: Simple risk score models generalize well; good performance on training data translates to good test performance.

Case Study: ICU Seizure Prediction:

Problem: Build a risk score for predicting seizures in ICU patients using clinical and complex cEEG data, requiring high calibration and satisfying numerous operational constraints (model size $\leq$ 4, monotonicity, complex feature exclusions/interactions).
RiskSLIM Benefits:
- Handled all operational constraints directly within the MINLP formulation, guaranteeing feasibility.
- Produced a certifiably optimal, interpretable risk score (2HELPS2B score, later refined under new constraints) with superior CAL/AUC compared to baseline methods.
- Baselines struggled: traditional methods often violated constraints; pooled methods required massive computation (nested CV) and still yielded suboptimal models with calibration issues (e.g., non-monotonic risk).
- The optimality gap provided by RiskSLIM allowed clinicians to reliably assess the performance impact of their constraints and make informed trade-offs (e.g., between a 4-feature and a 5-feature model).
- The resulting small integer scores were easily interpretable and could be translated into simple Boolean rules for specific risk thresholds.

Contributions:

A new formulation (RiskSlimMINLP) and machine learning approach (RiskSLIM) to create optimized, interpretable risk scores satisfying operational constraints.
A novel cutting plane algorithm (LCPA) for non-convex empirical risk minimization that avoids stalling and scales linearly with sample size.
Specialized algorithmic improvements (DCD, SequentialRounding, ChainedUpdates, etc.) to make LCPA practical.
Extensive experiments demonstrating RiskSLIM's superior performance over heuristic methods and highlighting pitfalls of traditional approaches.
A real-world case paper showcasing the practical benefits of handling complex constraints and using optimality guarantees in high-stakes domains.
An open-source Python package (risk-slim) implementing the approach.

Conclusion:

The paper presents a principled and practical method for learning high-performing, interpretable, and customized risk scores by directly solving a tailored optimization problem. The proposed LCPA algorithm makes this computationally feasible for real-world problems, offering significant advantages over ad-hoc heuristic approaches, particularly when operational constraints and performance guarantees are critical.