GOSDT: Optimal Sparse Decision Trees

Updated 8 January 2026

GOSDT is a scalable framework for constructing globally optimal decision trees by explicitly controlling tree complexity through sparsity constraints.
It integrates weighted loss functions and regularization to accommodate cost-sensitive learning and optimal policy design in diverse data settings.
Its use of branch-and-bound, dynamic programming, and advanced pruning strategies ensures efficient, interpretable model construction with provable optimality.

Generalized Optimal Sparse Decision Trees (GOSDT) encapsulate a comprehensive, scalable framework for constructing globally optimal decision trees subject to explicit sparsity constraints. The approach is central to interpretable machine learning, providing models that jointly optimize accuracy and interpretability by explicitly controlling tree complexity through regularization. Recent extensions further enable principled handling of weighted data, which is crucial for settings such as optimal policy design or inverse propensity scoring. GOSDT combines advanced combinatorial optimization—dynamic programming, branch-and-bound pruning, and incremental bound refinement—to efficiently search the space of tree structures under a broad array of objective functions and dataset regimes (Lin et al., 2020, Behrouz et al., 2022).

1. Mathematical Formulation and Losses

Given training data $\{(x_i, y_i)\}_{i=1}^N$ with $x_i \in \mathbb{R}^M$ and $y_i \in \{0,1\}$ (or, for weighted settings, per-sample nonnegative weights $w_i$ ), GOSDT seeks to identify a decision tree $d$ minimizing the regularized risk: $R(d) = \ell(d; x, y) + \lambda H_d,$ where $\ell$ is a monotonic loss function (e.g., weighted misclassification, balanced error, $F_1$ , AUC), $H_d$ is the number of leaves, and $\lambda \geq 0$ explicitly regularizes model complexity (Lin et al., 2020). The weighted extension generalizes the empirical risk: $\mathcal{L}_w(t) = \frac{1}{\sum_{i=1}^N w_i} \sum_{i=1}^N w_i\,\mathbf{1}\,[y_i \neq \hat{y}_i^t]$ with optimization objective

$R_w(t) = \mathcal{L}_w(t) + \lambda H_t, \quad \text{s.t.}\ \text{depth}(t) \leq d.$

Losses supported include cost-weighted misclassification, balanced error, area under the ROC convex hull (AUC $_{\rm ch}$ ), and partial AUC constraints, consistently extended for weighted settings (Lin et al., 2020, Behrouz et al., 2022).

2. Core Algorithmic Mechanisms

GOSDT’s framework is underpinned by a branch-and-bound architecture that combines dynamic programming over subsets of the training data with priority-queue–driven search and aggressive subproblem pruning (Lin et al., 2020). Each subproblem corresponds to building a subtree for a specific support set of samples, efficiently cached using bit-vector or hash-based representations. Multiple lower and upper bounds—hierarchical objective bounds, one-step lookahead, leaf-permutation (symmetry), equivalent-points, and incremental similar-support bounds—guide the pruning strategy and accelerate convergence.

In the context of continuous variables, GOSDT eschews pre-bucketing by constructing binary features for all candidate thresholds, with sub-exponential pruning in practice due to the incremental similar-support bound. For weighted data, GOSDT admits three complementary extensions:

Direct branch-and-bound with weighted loss: Fully supports weighted objectives but introduces $O(N)$ cost per node due to general dot-product operations (Behrouz et al., 2022).
Integer-weight transformation (data duplication): Approximates real weights with integer scaling and sample duplication, preserving bit-operation speedups and providing provably small approximation error.
Randomized weighted sampling: Draws a subsample with probabilities proportional to $w_i/\sum_j w_j$ , maintaining unweighted GOSDT optimizations while bounding risk deviation via Hoeffding’s inequality (Behrouz et al., 2022).

3. Scalability, Complexity, and Implementation

The worst-case time complexity is combinatorial (for $M$ features, $\sum_{k=0}^M 2^k\binom{M}{k}k! = O(2^M M!)$ ), but core innovations—memorization via support-set hashing, prefix-sum bitvector computations, lower/upper bounds propagation, and parallelization—yield dramatically superior empirical performance for moderate-to-large $N$ and $M$ (Lin et al., 2020). Bitwise operations enable sublinear scaling in practice for common settings. The branch-and-bound process is guaranteed to converge either with an exact optimum or a certified optimality gap.

For weighted GOSDT, the integer-weight transformation yields dataset expansion to $M \leq pN$ , where $p$ is the integer scaling factor; practical runtimes are often two orders of magnitude faster than direct weighted optimization. Weighted sampling scales to very large datasets, with an error bound for empirical risk difference that decays exponentially in sample size (Behrouz et al., 2022).

4. Integration of Diverse Objectives and Policy Design

GOSDT is designed to optimize not just classic misclassification error but any loss monotonic in false positives and false negatives. This includes:

Weighted accuracy: explicit cost for false negatives via parameter $\omega$ .
Balanced accuracy: averaging class-conditional errors.
$F_1$ -loss and AUC: via leaf-level statistics and leaf-label assignments.
Partial AUC: restricting attention to leaves corresponding to low false-positive regions.

The explicit incorporation of per-sample weights enables optimal policy design, such as inverse propensity–weighted treatment rules in causal inference and cost-sensitive policy learning (Behrouz et al., 2022). This cannot be realized by traditional discrete-loss, unweighted optimal tree methods.

5. Comparative Performance and Empirical Evaluation

Extensive experiments demonstrate that GOSDT and its weighted extensions attain higher accuracy at fixed sparsity than greedy (CART), mixed-integer programming (MIP-based ODT), and itemset DP (DL8.5, BinOCT) baselines. For example, on datasets such as FICO ( $N \approx 10^4$ ), the data-duplication and sampling approaches achieve near-optimal solutions ( $<1\%$ objective gap) in under a minute versus hours for direct weighted optimization. GOSDT dominates the test accuracy versus complexity Pareto frontier, especially when compared at fixed leaf-counts (Behrouz et al., 2022, Lin et al., 2020).

Algorithm	Objective Support	Handles Weighting	Practical Scaling
GOSDT	monotonic losses	Yes (extensions)	$\sim 10^4$ – $10^5$ samples
OSDT/DL8.5/BinOCT	0–1 loss	No	Varies
MIP ODT	Flexible	Yes (IP)	Small $N$ , small trees

6. Extensions, Limitations, and Theoretical Guarantees

All weighted GOSDT variants inherit the core framework’s guarantees: convergence to the (possibly approximate) global optimum, certification of optimality, and tree interpretability. Integer-weight duplication yields a proven bound on error relative to the optimal weighted objective, with control via the scaling parameter and empirical weight discrepancy (Behrouz et al., 2022). Weighted-sampling methods guarantee that, for any fixed tree, empirical risk on the sample is close to the true weighted risk with high probability.

Limitations include exponential scaling in the number of effective binary features and sample equivalence classes; memory usage may grow with increased $U$ , the number of distinct sample support sets. Non-monotonic, highly non-additive loss functions are not directly supported without deriving new lower bounds. In practice, GOSDT is most effective for moderate-dimensional, structured tabular data.

7. Relationship to Other Optimal Tree Methods

GOSDT advances over OSDT and integer-programming–based ODTs by combining scalable DP/B&B with tight, leaf-aware lower bounds and efficient data representations. In contrast to "Branches" (AO*-style AND/OR graph search), GOSDT identifies subproblems via data-support sets, imposes user-set maximum depth (or leaf) constraints, and leverages bitwise computation for speed. GOSDT requires discretized (binary) feature representations for continuous variables but efficiently handles vast numbers of thresholds; methods such as Branches can natively support multi-valued features without one-hot encoding, potentially yielding further efficiency in certain regimes (Chaouki et al., 2024, Gunluk et al., 2016).

A plausible implication is that GOSDT’s generalizations via weighting and objective flexibility establish it as a key algorithmic paradigm for interpretable model construction whenever sample importance, treatment regimes, or complex policy objectives are central concerns.