Interactive RL with Decision Tree Feedback
- Interactive Reinforcement Learning with Decision Tree Feedback is a framework that integrates structured, interpretable decision trees with RL to boost sample efficiency and performance.
- It employs iterative loops where RL agents expose weaknesses in tree policies while expert or LLM feedback refines these structures.
- Applications include game AI, reward and feature learning, and biometric identification by leveraging transparent decision logic and adaptive learning.
Interactive Reinforcement Learning with Decision Tree Feedback refers to a class of methodologies that integrate decision tree models into the interactive loop of reinforcement learning (RL). This paradigm exploits the structured, interpretable nature of decision trees to inform RL agent learning, to shape reward or policy spaces, to inject domain knowledge, or to enable human-in-the-loop corrections. The feedback cycles can be fully automated, human-guided, or organized as iterative hybrid loops where decision trees supply structure used by RL processes, and RL in turn identifies weaknesses or optimizes tree-based decision logic. The approach addresses interpretability, sample efficiency, and alignment challenges present in pure neural or black-box RL systems, and has been instantiated in settings spanning game AI, reward learning, feature selection, and biometric identification.
1. Formalizations and Problem Domains
Across instantiations, interactive RL with decision tree feedback is commonly formulated via Markov Decision Processes (MDPs) or multi-agent frameworks, with decision trees appearing as policy representations, reward models, or feature hierarchy encoders.
- In the RL-LLM-DT framework for two-player zero-sum games, state space contains full environment configurations, action space encodes executable moves (e.g., curling throws), and transitions model environment physics. Decision-tree policies dictate fully specified if–then strategies (Lin et al., 2024).
- In feature selection, each RL agent selects features under state constructed from graph and decision-tree representations; tree-derived importance scores and splits inform both state encoding and per-agent rewards (Fan et al., 2020).
- In interactive reward learning, Differentiable Decision Trees (DDT) parameterize the reward model used by the RL agent, trained from human preferences via end-to-end backpropagation (Kalra et al., 2023).
- In interactive ID recognition, the evolving state includes statistics over each node of potentially multiple tree models, which are incrementally refined by RL-guided edits, rewarded based on expert corrections (Li et al., 2021).
2. Interactive Loops: RL, Tree Feedback, and Automation
The interaction between RL and decision trees is realized via tightly coupled feedback cycles where tree evaluation and RL policy optimization inform each other. The prominent RL-LLM-DT methodology operates by:
- Fixing a candidate decision tree as policy,
- Training an RL agent (typically PPO-based) to discover counter-strategies or adversarial trajectories against the tree,
- Recording failure traces where RL surpasses the tree (>50% win-rate),
- Using an LLM Critic to analyze traces, suggest concrete tree modifications in natural language,
- Automating the code synthesis of improved trees via an LLM Coder,
- Iterating until RL cannot reliably exploit the tree or no further constructive tree refinements are produced (Lin et al., 2024).
The process is formalized as an iterative algorithm where decision trees and RL agents alternately refine and probe each other, with LLMs acting as policy improvement or repair agents. In alternative frameworks, human experts may play the role of tree critic, offering data-labeled corrections that drive RL-driven structural tree edits (Li et al., 2021).
3. Methodological Instantiations
3.1 Automated Decision Tree Generation via RL and LLM
The RL-LLM-DT library embodies an automatic, closed-loop improvement of game AIs:
- RL Evaluation Step: The challenger policy is optimized by maximizing
utilizing distributed PPO.
- LLM Enhancement: After RL exposes weaknesses, an LLM Critic analyzes match traces and produces improved tree specifications which are auto-compiled to functional code by an LLM Coder.
- Termination: If no RL agent can beat the latest tree, or if the LLM fails to generate a semantically different improvement, the process halts.
Experiments in the Jidi curling simulator demonstrate that this iterative RL-tree-LLM loop can produce strategies that outperform human-crafted trees, with final agents attaining top leaderboard ranks (Lin et al., 2024).
3.2 Differentiable Reward Trees from Human Feedback
Reward learning with DDTs constructs interpretable, continuous reward functions,
where are soft path probabilities through the tree. Tree parameters are optimized to fit human trajectory preferences. Plugging the DDT reward into RL allows traceability of agent motivation, with a clear tradeoff: soft output DDTs yield best RL performance; hard, path-deterministic DDTs maximize interpretability (Kalra et al., 2023).
3.3 RL-Guided Tree Editing with Human Feedback
Interactive person identification via RLTIR encodes current model structure as a tree ensemble state. Actions (e.g., expanding, collapsing, or density-tuning nodes) are selected by a Q-network. Rewards are derived from expert-provided corrections. This iterative loop incrementally steers the tree ensemble toward improved accuracy and robustness while adaptively minimizing user labeling effort (Li et al., 2021).
3.4 Feature Selection as Multi-Agent RL with Tree Feedback
In feature selection, agents select feature subsets. Decision tree feedback enriches both state—by injecting splitting-hierarchy structure into graph convolutional state embeddings—and reward, via personalized accuracy and feature importance scores. The loop permits faster and more accurate feature discovery than either RL or classical methods alone (Fan et al., 2020).
4. Representative Algorithms and Prompting Paradigms
Prompt engineering and automated code generation with LLMs enable efficient automation of the loop.
- LLM Prompts:
- Coder Prompt: Requests explicit code implementations of tactics, given formal rules and a tree specification.
- Critic Prompt: Supplies rules, prior tree logic/code, and failure traces, soliciting a revised, more robust tree in natural language.
- RL Training (PPO Loss):
- Decision Tree Incremental Update Actions:
Tree modifications (increase/decrease density, expand/collapse node) are mapped to local structural changes in response to RL agent or human signals (Li et al., 2021).
5. Experimental Findings and Comparative Performance
Extensive empirical results demonstrate the effectiveness and practicality of interactive RL with decision tree feedback.
Game AI (RL-LLM-DT/Jidi Curling)
- Incremental LLM-enhanced trees, after iterative RL-LLM refinement, achieve leaderboard supremacy:
- Tree III (final auto-refined tree): score 0.93, rank 1/34 agents
- Human-designed tree: score 0.80, rank 3/34
- Training time per RL iteration increases as trees grow stronger, confirming adversarial robustness (Lin et al., 2024).
Reward Learning
- DDT-based reward models, when used with PPO, frequently match or surpass black-box reward nets, particularly for well-shaped (soft) reward settings. For instance, CRL-DDT(soft) outperforms a neural net baseline on CartPole, and depth-2 DDTs nearly match optimal policy performance on gridworlds and Atari benchmarks (Kalra et al., 2023).
Interactive ID Recognition
- RLTIR outperforms conventional baselines on gait and keystroke datasets, with moderate (<35%) human feedback per instance and steady improvements in AUC and F1-score over static models (Li et al., 2021).
Feature Selection
- Interactive RL with decision tree feedback provides 2–3× faster convergence and up to 5-point accuracy boosts over classical selectors. Incorporating dual-graph state encoding and tree-importance-based rewards consistently improves returns (Fan et al., 2020).
| Method | Task Domain | Key Result |
|---|---|---|
| RL-LLM-DT (Lin et al., 2024) | Game AI/Curling | Rank 1/34, score 0.93 |
| DDT-RLHF (Kalra et al., 2023) | RLHF/Atari, Gridworld | DDT matches/surpasses reward net baselines |
| RLTIR (Li et al., 2021) | Biometric ID | AUC +1.5 pp (gait), +0.4 pp (keystroke) |
| IRLFS + DTF (Fan et al., 2020) | Feature selection | 2–3× convergence, +5 pts vs classical filter |
6. Interpretability, Automation, and Limitations
The recurring motivation is interpretability: tree policies, reward structures, and feature importances are human-traceable, enabling audit, diagnosis, and real-time (expert or LLM) intervention. Automation is increasingly seamless where LLMs subsume the expert role in refining strategy (Lin et al., 2024).
Several limitations are apparent:
- LLM refinement quality depends on rule knowledge and domain specificity; after several iterations, feedback quality may decay, occasionally resulting in trivial or redundant tree edits (Lin et al., 2024).
- For high-dimensional or complex domains (e.g., Go, StarCraft), scalability may require fine-tuned LLMs or additional methodologic innovation (such as Monte Carlo Tree Search in the critique phase).
- There exists a fundamental tradeoff between reward shaping/capacity (soft DDTs) and interpretability (hard, path-deterministic DDTs) (Kalra et al., 2023).
- Nontrivial computational overhead is introduced by real-time tree encoding, graph construction, and iterative RL in large state/action spaces (Fan et al., 2020).
- Fully automated, generalizable convergence guarantees for hybrid graph/tree-RL architectures remain open.
7. Extensions and Future Research Directions
Emerging research explores the generalization of interactive RL/tree feedback frameworks to:
- Automated agent self-play involving multiple competing tree policies, enabling discovery of edge cases and robustness improvements (Lin et al., 2024).
- Incorporation of richer graph-based feedback (e.g., random forests, gradient boosting) as downstream trainers (Fan et al., 2020).
- Generalizing feature selection approaches to non-decision-tree downstream models and integrating continuous-action relaxations (Fan et al., 2020).
- Human-in-the-loop hybridization in online settings, to combine explanation, transparency, and autonomous RL-driven discovery (Li et al., 2021).
A plausible implication is that the fusion of deep RL, interpretable tree models, and automated language-based suggestion systems will enable increasingly robust, diagnostically transparent, and sample-efficient learning across a wide array of sequential decision domains.