Sparse Decision Tree Learning

Updated 12 November 2025

Sparse decision tree learning is a method that imposes explicit sparsity penalties on tree structures to control complexity and enhance interpretability.
It leverages various algorithmic paradigms, including greedy heuristics, convex regularization, and combinatorial optimization, to balance accuracy and sparsity.
Empirical evaluations show that sparse trees achieve competitive performance with fewer leaves and features compared to traditional decision tree methods.

Sparse decision tree learning concerns the construction of decision trees with explicit control over tree complexity (e.g., number of leaves, variables used in splits, or depth) to enhance interpretability and generalization—central objectives for interpretable machine learning. The field spans convex and combinatorial optimization, statistical learning theory, and practical algorithm design, motivated by the NP-hardness of finding globally optimal sparse trees. Recent advances have yielded exact, near-exact, and regularized solutions, supported by new algorithmic paradigms, theoretical guarantees, and scalable implementations.

1. Formalization of Sparsity in Decision Tree Learning

Sparse decision tree learning addresses the supervised learning problem by regularizing the complexity of the model. The canonical setup involves a dataset $D = \{(x_i, y_i)\}_{i=1}^N$ with $x_i \in \mathcal{X} \subseteq \mathbb{R}^p$ (or categorical), and $y_i$ in a finite label set. A decision tree $T$ partitions $\mathcal{X}$ into disjoint regions corresponding to leaf nodes, each labeled by a prediction $\hat{y}_\ell$ .

Sparsity is encoded via penalties on:

The number of leaves ( $H_T$ ) or splits ( $S(T)$ ).
The number of distinct variables appearing in splits (global variable usage).
The number of variables per split (local sparsity).
The total number of used features along any root-leaf path (path sparsity).

A typical optimization objective is: $\min_T \ \frac{1}{N} \sum_{i=1}^N \mathbf{1}[T(x_i) \neq y_i] + \lambda S(T)$ where $S(T)$ can be the number of splits or leaves, and $\lambda > 0$ governs the accuracy-sparsity trade-off (Hu et al., 2019, Blanquero et al., 2020, Lin et al., 2020, Chaouki et al., 2024, Babbar et al., 21 Feb 2025, Arslan et al., 5 Nov 2025).

Alternative formulations, particularly in oblique (linear-combination) trees, extend this framework using groupwise $\ell_1$ or $\ell_\infty$ sparsity norms to control local and global variable usage (Blanquero et al., 2020, Hou et al., 2020).

For ensembles or stumps, sparsity targets variable-selection consistency, seeking the true (or nearly true) active set among a large candidate pool (Banihashem et al., 2023).

2. Algorithmic Paradigms for Sparse Tree Construction

Sparse tree induction methods can be classified as:

(a) Greedy Heuristics:

CART, C4.5, and their variants iteratively select the best split by impurity reduction but provide no global optimality or sparsity control. Pruning heuristics are used post hoc.

(b) Convex Regularization for Oblique/Soft Trees:

Oblique trees such as S-ORCT (Blanquero et al., 2020) and sparse weighted oblique decision trees (SWODT) (Hou et al., 2020) enforce sparsity via convex or polyhedral regularization: $\min_{a,\mu,C}\ g(a,\mu,C) + \lambda^L \sum_{j=1}^p \|a_{j\cdot}\|_1 + \lambda^G \sum_{j=1}^p \|a_{j\cdot}\|_\infty$ solved approximately by nonconvex NLPs with multistart or interior-point methods.

(c) Combinatorial Branch-and-Bound, Dynamic Programming, AND/OR Search:

For axis-aligned, binary-valued (and in some cases categorical-valued) trees, globally optimal trees are learned by dynamic programming and branch-and-bound using analytical bounds.

OSDT: Bit-mask leaf representation, hierarchical and one-step lookahead bounds, permutation pruning (Hu et al., 2019).
GOSDT: Support-set dynamic programming, with aggressive pruning via equivalent-points and similar-support bounds. Handles class-imbalance and continuous-threshold enumeration (Lin et al., 2020).
AO*-based methods (“Branches”): AND/OR graph search, purification heuristic bounds for best-first search, direct support for multi-way categorical variables (Chaouki et al., 2024).
SPLIT/LicketySPLIT: Partial optimization via controlled lookahead—full DP to depth $d_l$ , greedy below—yielding exponential speedups over exact DP with minimal accuracy loss (Babbar et al., 21 Feb 2025).
SORTD: Anytime (best-first) enumeration of the Rashomon set (near-optimal trees), with tight data-caching and pruning for scalability (Arslan et al., 5 Nov 2025).

(d) Decision Stumps for Sparse Recovery:

Theoretical analysis of one-level trees (“stumps”) for variable selection with high-dimensional statistical guarantees (Banihashem et al., 2023).

3. Explicit Sparsity–Accuracy Trade-Offs and Regularization

Sparsity-inducing penalties enable explicit navigation of accuracy–complexity Pareto frontiers. For axis-aligned trees, varying $\lambda$ directly traces test-error vs sparsity curves.

For oblique trees, local-sparsity ( $\ell_1$ ) and global-sparsity ( $\ell_\infty$ ) parameters ( $\lambda^L$ , $\lambda^G$ ) can be varied on a logarithmic grid, yielding empirical surfaces $(\delta^L, \delta^G, \text{accuracy})$ (Blanquero et al., 2020).
The optimal penalty can be predicted a priori via critical values of $\lambda^L, \lambda^G$ beyond which the solution returns $a \equiv 0$ (Blanquero et al., 2020).

For Rashomon-set enumeration, the regularization parameter and Rashomon bound $\varepsilon$ jointly determine the breadth and complexity of the model family (Arslan et al., 5 Nov 2025). Empirically, substantial reductions of model size are possible (often $2\times$ – $5\times$ fewer leaves) without compromising generalization (Lin et al., 2020, Babbar et al., 21 Feb 2025, Blanquero et al., 2020).

4. Optimization, Complexity, and Theoretical Guarantees

The sparse decision tree problem is NP-hard (Hyafil & Rivest 1976). Practical tractability follows from:

Pruning bounds: Hierarchical cost bounds, one-step lookahead bounds, node support bounds, equivalent-points minimum-error bounds, and child-symmetry (permutation) pruning (Hu et al., 2019, Lin et al., 2020).
Memoization: Hashing of support sets or clauses to reuse subproblem solutions (Lin et al., 2020, Chaouki et al., 2024).
Admissible heuristics: Purification upper bounds in AO* (Chaouki et al., 2024).

Complexity results include:

Problem-dependent: The number of subproblems $\Gamma(q, C, \lambda)$ expanded is governed by data size, feature count, splits ( $S(T^*)$ ), and objective—the largest problems are pruned by regularization (Chaouki et al., 2024).
Runtime scaling: GOSDT and SPLIT achieve sub-exponential or polynomial scaling in practical regimes. LicketySPLIT achieves fully polynomial runtime $\mathcal{O}(N K^2 d^2)$ with controlled accuracy loss (Babbar et al., 21 Feb 2025).
Certificates of optimality: When bounds tighten to equality, algorithms output globally optimal trees for the specified objective and complexity budget (Hu et al., 2019, Lin et al., 2020, Chaouki et al., 2024).

For stumps, exact recovery of the true active set under linear/additive models is possible with $n = O(s\log p)$ samples, matching Lasso minimax rates (Banihashem et al., 2023).

5. Empirical Evaluation and Practical Guidelines

Sparse tree algorithms are benchmarked on public datasets (UCI, COMPAS, FICO, etc.) and high-dimensional/imbalanced data.

Key empirical findings:

GOSDT, OSDT, and Branches provide globally optimal sparse trees with higher accuracy than greedy methods at the same sparsity (Hu et al., 2019, Lin et al., 2020, Chaouki et al., 2024).
SPLIT and LicketySPLIT yield nearly optimal sparse trees in $\leq 1\%$ the time of exact solvers, with test loss rarely exceeding that of full DP (Babbar et al., 21 Feb 2025).
SORTD enables Rashomon set exploration up to two orders-of-magnitude faster than prior methods, supporting variable importance and multi-objective trade-off post-evaluation (Arslan et al., 5 Nov 2025).
For oblique trees, $\ell_1$ (local) and $\ell_\infty$ (global) regularization can eliminate $30\%$ – $75\%$ of variables with negligible loss or mild improvements in accuracy (Blanquero et al., 2020).

Guidelines for practitioners:

Choose regularization parameters $(\lambda, \lambda^L, \lambda^G)$ via grid search on performance–sparsity plots.
For interpretability, restrict tree depth (e.g., $D=1$ or $2$) and tune $\lambda$ for desired leaf count.
For mixed or categorical data, leverage algorithms (e.g., Branches) that natively support non-binary features (Chaouki et al., 2024).
For imbalanced data, encode per-sample weights or select balanced metrics at training (Lin et al., 2020).
Warm-start with greedy (CART) solutions where supported (Chaouki et al., 2024).

6. Extensions: Oblique Trees, Embedded Constraints, and Rashomon Sets

Oblique Sparse Trees:

Oblique splitting (hyperplanes) with joint $\ell_1$ and $\ell_2$ penalty (SWODT) yields rule-matrices that are both accurate and highly sparse, directly embeddable as constraints in power system dispatch via Big-M MILP formulations (Hou et al., 2020).

Enumerating Rashomon Sets:

Enumerating all near-optimal trees enables robust variable-importance analysis, user-driven model selection, and post-hoc filtering for additional objectives (e.g., fairness). SORTD's best-first enumeration finds Rashomon sets in order of objective value, supporting anytime retrieval and efficient memory/data caching (Arslan et al., 5 Nov 2025, Babbar et al., 21 Feb 2025).

Stump-Based Sparse Recovery:

Sparse variable selection with stumps achieves minimax sample complexity and generalizes to monotonic non-linear functions and arbitrary sub-Gaussian feature distributions (Banihashem et al., 2023). This framework provides theoretical understanding for the success of tree-based feature selection.

7. Scope, Limitations, and Open Challenges

Notwithstanding the progress, key challenges remain:

Poised trade-offs between scalability and (approximate) optimality on high-dimensional datasets;
Efficient handling of continuous features without discretization or severe feature blow-up;
The extension of sparsity-control and global optimization to deeper trees, multi-class objectives, and regression with structure-inducing penalties;
Incorporation of domain constraints, fairness, or other post-hoc objectives into the tree-learning pipeline via Rashomon sets or multi-objective optimization.

Contemporary algorithms balance these axes via a mixture of combinatorial, convex, and statistical approaches, state-of-the-art branching heuristics, and parallelization when feasible. The literature documents continual gains in tractability, flexibility, and theoretical assurance, underlining sparse decision tree learning as an active, foundational area at the intersection of interpretability and optimization in machine learning (Blanquero et al., 2020, Lin et al., 2020, Hu et al., 2019, Chaouki et al., 2024, Babbar et al., 21 Feb 2025, Arslan et al., 5 Nov 2025, Banihashem et al., 2023, Hou et al., 2020).