Decision Trees under Partial Observability
- Decision trees under partial observability are models that use observable features to guide sequential decisions when the full state is unknown.
- Key methodologies include table-based sampling with Q-learning and differentiable soft decision trees that yield compact, interpretable policy representations.
- These approaches enhance efficiency and interpretability in various applications, such as industrial logistics and clinical decision support, while addressing unobservable state constraints.
Decision trees under partial observability (PO) refer to tree-based models for sequential decision making, prediction, or control where the agent does not have full access to the true underlying state but only to some observation or feature set. The PO setting fundamentally alters both the representation of state, the policy class, and the computational procedures for optimal or interpretable tree construction. Research has developed formal frameworks for PO in statistical learning, sequential decision processes, and interpretable model extraction. This article surveys structural foundations, key algorithmic paradigms, formal complexity results, and practical strategies for learning and interpreting decision trees under partial observability, with special attention to recent advances.
1. Formal Foundations: Observations, State Abstraction, and Policy Classes
Partial observability is defined by a mapping from the underlying state space to a set of observable features or variables . In general, each observation corresponds to an equivalence class of states that are indistinguishable given the observable features. A policy under PO is then a mapping , where is the action set.
Two major approaches to policy representation and deployment arise:
- Tabular or table-based memoryless PO: The policy is directly a mapping from to actions, ignoring unobservable state history altogether (Budde et al., 2024).
- History-augmented PO: The policy is represented as , where is the current observation and is an embedding of the past observation/action history, typically learned or constructed recursively (Pace et al., 2022).
Partial observability induces nontrivial equivalence classes among states, leading to the requirement of "consistency"—the set of enabled actions under 0 must be identical across all underlying states with the same observation 1 (Budde et al., 2024). Strategies/model extraction methods must respect this property.
2. Methodologies for Strategy Construction under PO
Several algorithmic routes to decision-tree policies under PO have been formalized:
2.1 Sampling and Reinforcement Learning-Based Extraction
One approach (Budde et al., 2024) proceeds as follows:
- Use lightweight strategy sampling (LSS) or table-based Q-learning to learn a memoryless PO policy 2 in a Markov automata model.
- Extract the resulting mapping from simulation trajectories: collect all observed 3 pairs and deduplicate.
- Input this tabular policy into a decision tree learner (e.g., dtControl tool) to obtain an axis-aligned tree that predicts 4 from 5 only.
The resulting tree is minimal with respect to observable features, highly compact, and interpretable. Empirical evidence on industrial-scale models (e.g., open-pit mining logistics) shows that trees extracted from PO policies are 1–2 orders of magnitude smaller than those from full observability (FO) strategies, rapidly capturing core domain rules (Budde et al., 2024).
2.2 Differentiable and Adaptive Tree Architectures
The POETREE framework (Pace et al., 2022) introduces fully-differentiable soft decision trees optimized via imitation learning in PO Markov Decision Processes (POMDPs). A decision policy 6 is modeled by a binary probabilistic tree:
- Every inner node computes a soft gate (logistic function) over the observation or observation-history embedding, allowing differentiable routing.
- Each leaf associates a softmax over actions, parameterizing a (possibly stochastic) policy.
- The tree is grown adaptively via validation-based splitting and pruned for parsimony. A history embedding 7 is updated in the leaves via a learned recurrence. Parameters are tuned by backpropagation to match observed expert behavior.
- This architecture bridges PO constraints with interpretability: the final tree is a transparent, evolving policy depending on observations and encoded history.
3. Policy Extraction, Tree Representation, and Interpretability under PO
PO introduces essential representational and interpretability advantages and challenges:
- Feature Selection and Model Size: Restricting trees to observable features yields dramatically more compact, interpretable decision rules. Under FO, irrelevant or unobservable variables needlessly increase tree width and depth (Budde et al., 2024).
- Axis-Aligned Trees via dtControl and Related Tools: Extracted trees operate only on observable features, facilitating rule-based policy diagrams like "if full=true then to_dump_1 else towards_shovel_0". Domain experts can directly validate or deploy these rules.
- Differentiability and Policy Flexibility: Fully-differentiable trees allow continuous optimization and automatic architecture selection under imitation constraints in POETREE (Pace et al., 2022).
In medical decision-support applications under PO, POETREE achieves superior performance and interpretability metrics compared to recurrent neural network (RNN)-based baselines and competitive methods, with clinician-rated clarity scores approaching that of classical decision trees (Pace et al., 2022).
4. Theoretical Properties: Consistency, Complexity, and Limitations
PO trees and associated extraction methods admit the following structural and computational guarantees:
- Policy Consistency: All extracted trees must respect action-set consistency within each observable equivalence class. The strategy sampling and Q-learning algorithms enforce this at the data collection and mapping stages (Budde et al., 2024).
- Learning and Evaluation: Lightweight strategy sampling leverages deterministic hashing and empirical evaluations, efficiently searching the policy space under PO. Q-learning on 8 requires only a hash table over visited 9 pairs.
- Model Size and Generalizability: Trees under PO are provably smaller and avoid overfitting to unobservable or irrelevant state-space dimensions. Empirical results show strong scalability and interpretability even with hundreds of thousands of possible system states (Budde et al., 2024).
- Limitations: No method can recover unmeasured latent state variables. Offline-only data and lack of interventions limit guarantees on outcome optimality. In high-dimensional observation spaces, additional regularization may be required for human readability (Pace et al., 2022).
5. Practical Applications and Empirical Evidence
Recent studies document practical deployment and validation of PO trees in several domains:
- Industrial Logistics: Open-pit mine scheduling demonstrates the scaling of PO trees to models with 80+ trucks and large grids, with empirically tight confidence intervals for maximized/minimized objectives (Budde et al., 2024).
- Clinical Policy Modeling: POETREE yields interpretable, history-adaptive treatment or intervention policies from large EHR or registry datasets. Metrics such as AUROC, AUPRC, Brier score, and interpretability ratings demonstrate effective transfer of physician decision rules under PO structures (Pace et al., 2022).
- Reinforcement Learning and Control: Decision trees over observable variables serve as transparent, memoryless controllers in Markov automata with partial observability, outperforming naïve or uniform strategies (Budde et al., 2024).
6. Extensions: Robustness, Alternative Tree Representations, and Variable Importance
Advanced work has considered the implications of PO for variable importance, predictive equivalence, and robustness to missing data:
- Rewriting Trees via Boolean Logical Forms: Decision boundaries can be represented canonically as Disjunctive Normal Forms (DNFs) over observable features, eliminating predictive equivalence among different trees yielding the same behavior, and enabling robust prediction in the presence of missing features via logical entailment (McTavish et al., 17 Jun 2025).
- Variable Importance under PO: DNF-based global importance metrics, counting the presence of variables in prime implicants across the DNF, yield missingness-robust, canonical importances irrespective of evaluation order (McTavish et al., 17 Jun 2025).
- Cost-Sensitive Feature Acquisition: Dynamic programming and Q-learning methods on partial assignments to observable features enable cost-efficient querying until a robust (i.e., logically entailed) prediction can be made under the PO tree (McTavish et al., 17 Jun 2025).
7. Summary Table: Representative PO Decision Tree Frameworks
| Framework | Policy Representation | Extraction Method | Interpretability |
|---|---|---|---|
| Q-Learning + dtControl (Budde et al., 2024) | Axis-aligned tree over 0 | Simulation/driven, tabular → tree | High |
| POETREE (Pace et al., 2022) | Differentiable soft-tree | Backpropagation, imitation learning | High (after pruning/adaptation) |
| DNF-Tree (McTavish et al., 17 Jun 2025) | Minimal DNF over 1 | Symbolic simplification | Canonical/logical |
In conclusion, decision trees under partial observability define a principled, technically robust, and empirically validated pathway for extracting compact, interpretable, and effective policies in a wide spectrum of sequential, causal, and control domains. Their foundations, methods of training and extraction, and robust representations position them as a central tool for interpretable policy and control learning under incomplete information.