Codified Decision Trees for Agent Behavior
- CDT is a hierarchical, explicitly interpretable decision structure that models narrative agent behavior through scene-conditioned rules.
- It is induced from scene-action pairs using clustering, LLM-driven hypothesis generation, and rigorous NLI-based validation.
- Empirical results show CDT and CDT-Lite outperform traditional methods by ensuring deterministic, transparent, and robust agent behavior.
A Codified Decision Tree (CDT) is an explicitly interpretable, executable decision structure for encoding behavioral profiles of agents, particularly in narrative or role-playing (RP) environments. Unlike traditional, static hand-authored profiles, CDT is constructed via a data-driven induction process over (scene, action) pairs, yielding a hierarchical tree whose branches are labeled by validated scene-conditioned predicates and whose leaves comprise grounded behavioral statements. This framework supports deterministic inference, rigorous validation, and transparent inspection, resulting in robust agent consistency across diverse contexts (Peng et al., 15 Jan 2026).
1. Formal Definition and Structure
A CDT for a character is a rooted tree whose nodes hold two kinds of content:
- A (possibly empty) set of behavioral statements , where is the set of grounded action statements.
- A (possibly empty) set of outgoing edges , each labeled by a predicate-question on scene descriptions.
Given the space of all textual scenes and a binary discriminator function , the execution (inference) semantics for a scene are:
- Initialize grounding set .
- For each outgoing edge labeled by , if , update and recurse on .
- The output is the union of all for those whose path from root satisfies every traversed .
Each edge predicate formalizes a rule antecedent ("if then..."), and each is a rule consequent ("...then "). A rule is the pair , where and is drawn from .
2. Learning Algorithm and Induction Process
CDT is induced from a dataset of pairs using the following recursive algorithm:
- Clustering: Similar pairs are grouped (e.g., via semantic embedding or clustering).
- Hypothesis Generation: For each cluster, a LLM is prompted to propose candidate pairs, where is a predicate applicable to and a behavioral action.
- Validation: Each hypothesis is evaluated on using NLI-style statistics:
- : number where NLI
- : number where NLI
- (entail-accuracy)
- (applicability)
- Acceptance/Rejection/Refinement:
- If , accept as rule;
- If or small, reject;
- If and depth , recurse for further specialization.
- Termination Criteria: The process stops when no further refinement is warranted.
Key hyperparameters include:
- (acceptance, e.g., 0.75)
- (rejection, e.g., 0.50)
- (filter, e.g., 0.75)
- (maximum depth)
- for recursion (e.g., 16)
3. Executability and Interpretability
CDT nodes store explicit, human-readable behavioral statements, and all branch predicates are labeled linguistically interpretable questions. Deterministic retrieval is guaranteed, as is a deterministic Boolean test (with an UnknownFalse policy). The result is that repeated queries on the same scene yield identical traversals and triggered behavioral actions.
Termination and decidability are ensured by constraints on both maximum tree depth and recursion dataset size. The construction guarantees that for any finite dataset, the induced CDT is finite and construction halts in steps (Peng et al., 15 Jan 2026).
4. Empirical Results and Benchmarks
CDT and its variant CDT-Lite were evaluated on several benchmarks:
- Datasets:
- Fine-grained Fandom: 8 artifacts, 45 characters, 20,778 pairs.
- Bandori Conversational: 8 bands, 40 characters, 7,866 pairs.
- Bandori Events: 77,182 pairs (scaling study).
- Metric: Natural language inference (NLI) score. Given a predicted action and reference , score if entail, $50$ if neutral, and $0$ if contradict; average is reported.
- Key Results (NLI Score Average):
| System | Fandom Avg | Bandori Avg |
|---|---|---|
| Vanilla | 55.6 | 65.5 |
| Fine-tune | 45.7 | 62.9 |
| RICL | 56.0 | 68.9 |
| ETA | 56.9 | 72.3 |
| Human | 58.3 | 71.3 |
| Codified-Human | 59.3 | 71.9 |
| CDT | 60.8 | 77.7 |
| CDT-Lite | 61.0 | 79.0 |
Removal of clustering, instruction-following embeddings, or validation degrades performance by 1–2 points. Performance scales monotonically with dataset size (Peng et al., 15 Jan 2026).
5. Example Construction
Consider the following illustrative dataset for a "Hero":
- "Dark tunnel ahead..." "Hero lights torch."
- "Walls glint in darkness..." "Hero lights torch."
- "Monster roar nearby..." "Hero draws sword."
- Cluster ; LLM hypothesizes "Does the scene mention darkness?", "Hero lights torch." Accepted as .
- Cluster ; LLM hypothesizes "Does the scene indicate presence of hostile creature?", "Hero draws sword." Accepted.
Final CDT in LaTeX:
6. Comparison to Related Methods and Representations
CDT offers improvements over both hand-authored codified human profiles and other induction methods. For Fandom, CDT-Lite outperforms Codified Human by +1.7 points (61.0 vs 59.3 NLI avg); for Bandori, by +7.1 points (79.0 vs 71.9). Overall, CDTs show relative improvements of 3–10% over the strongest human and prior data-driven baselines (Peng et al., 15 Jan 2026).
While CDT leverages a tree structure reminiscent of classic decision trees, the construction and inference are semantically adapted to natural language scene affordances and behavioral logic, not feature-threshold predicates. By contrast, computational graph representations of traditional binary and oblique decision trees have been formalized via parallel predicate evaluation and bitvector arithmetic over structured inputs, supporting soft traversals and hybridization with differentiable models (Zhang, 2021). CDTs focus distinctly on context-conditional action logic derived from narrative data rather than numerical features.
7. Limitations and Future Developments
Current CDT methodology is restricted to offline (non-continual) construction and induction solely from narrative storyline data, without leveraging canonical trait priors or multimodal context (e.g., game state). Future directions include:
- Joint CDT induction for multiple interacting characters.
- Online refinement and continual learning from live agent interaction.
- Multimodal CDT expansion incorporating event logs and real-time state signals.
These directions address domains where principled, interpretable, and efficiently updatable behavioral logic is required for robust agent grounding under complex, evolving contexts (Peng et al., 15 Jan 2026).