Codified Decision Trees for Agent Behavior

Updated 4 February 2026

CDT is a hierarchical, explicitly interpretable decision structure that models narrative agent behavior through scene-conditioned rules.
It is induced from scene-action pairs using clustering, LLM-driven hypothesis generation, and rigorous NLI-based validation.
Empirical results show CDT and CDT-Lite outperform traditional methods by ensuring deterministic, transparent, and robust agent behavior.

A Codified Decision Tree (CDT) is an explicitly interpretable, executable decision structure for encoding behavioral profiles of agents, particularly in narrative or role-playing (RP) environments. Unlike traditional, static hand-authored profiles, CDT is constructed via a data-driven induction process over (scene, action) pairs, yielding a hierarchical tree whose branches are labeled by validated scene-conditioned predicates and whose leaves comprise grounded behavioral statements. This framework supports deterministic inference, rigorous validation, and transparent inspection, resulting in robust agent consistency across diverse contexts (Peng et al., 15 Jan 2026).

1. Formal Definition and Structure

A CDT for a character $x$ is a rooted tree $T$ whose nodes $v$ hold two kinds of content:

A (possibly empty) set $H_v$ of behavioral statements $h \in A$ , where $A$ is the set of grounded action statements.
A (possibly empty) set of outgoing edges $(v \to v_i)$ , each labeled by a predicate-question $q_i$ on scene descriptions.

Given the space $S$ of all textual scenes and a binary discriminator function $\mathrm{check}(s,q) \in \{ \mathrm{True}, \mathrm{False}, \mathrm{Unknown} \}$ , the execution (inference) semantics for a scene $s_0$ are:

Initialize grounding set $G \gets H_{\mathrm{root}}$ .
For each outgoing edge $(\mathrm{root} \to v_i)$ labeled by $q_i$ , if $\mathrm{check}(s_0, q_i)=\mathrm{True}$ , update $G \gets G \cup H_{v_i}$ and recurse on $v_i$ .
The output $G(s_0)$ is the union of all $H_v$ for those $v$ whose path from root satisfies every traversed $q_i$ .

Each edge predicate $q$ formalizes a rule antecedent ("if $C$ then..."), and each $h \in H_v$ is a rule consequent ("...then $A$ "). A rule $R$ is the pair $(C \to A)$ , where $C(s) = \bigwedge_{i \in I}[\mathrm{check}(s, q_i) = \mathrm{True}]$ and $A$ is drawn from $H_v$ .

2. Learning Algorithm and Induction Process

CDT is induced from a dataset $D$ of $(\text{scene } s, \text{action } a)$ pairs using the following recursive algorithm:

Clustering: Similar $(s,a)$ pairs are grouped (e.g., via semantic embedding or clustering).
Hypothesis Generation: For each cluster, a LLM is prompted to propose candidate $(q, h)$ pairs, where $q$ is a predicate applicable to $s$ and $h$ a behavioral action.
Validation: Each hypothesis is evaluated on $D$ $D$ using NLI-style statistics:
- $r_e$ : number where NLI $(a \to h)=\text{entail}$
- $r_c$ : number where NLI $(a \to h)=\text{contradict}$
- $r_n = |D| - r_e - r_c$
- $\mathrm{acc} = r_e /(r_e + r_c)$ (entail-accuracy)
- $\mathrm{app} = r_e /(r_e + r_n + r_c)$ (applicability)
Acceptance/Rejection/Refinement:
- If $acc \geq \theta_{acc}$ , accept as rule;
- If $acc \leq \theta_{rej}$ or $|D'|$ small, reject;
- If $frac \leq \theta_f$ and depth $< d_{max}$ , recurse for further specialization.
Termination Criteria: The process stops when no further refinement is warranted.

Key hyperparameters include:

$\theta_{acc}$ (acceptance, e.g., 0.75)
$\theta_{rej}$ (rejection, e.g., 0.50)
$\theta_f$ (filter, e.g., 0.75)
$d_{max}$ (maximum depth)
$\min|D'|$ for recursion (e.g., 16)

3. Executability and Interpretability

CDT nodes store explicit, human-readable behavioral statements, and all branch predicates are labeled linguistically interpretable questions. Deterministic retrieval is guaranteed, as $\mathrm{check}(s, q)$ is a deterministic Boolean test (with an Unknown $\to$ False policy). The result is that repeated queries on the same scene $s$ yield identical traversals and triggered behavioral actions.

Termination and decidability are ensured by constraints on both maximum tree depth and recursion dataset size. The construction guarantees that for any finite dataset, the induced CDT is finite and construction halts in $O(d_{max}\cdot|D|)$ steps (Peng et al., 15 Jan 2026).

4. Empirical Results and Benchmarks

CDT and its variant CDT-Lite were evaluated on several benchmarks:

Datasets:
- Fine-grained Fandom: 8 artifacts, 45 characters, 20,778 $(s,a)$ pairs.
- Bandori Conversational: 8 bands, 40 characters, 7,866 pairs.
- Bandori Events: 77,182 pairs (scaling study).
Metric: Natural language inference (NLI) score. Given a predicted action $\hat{a}$ and reference $a$ , score $(\hat{a},a)=100$ if entail, $50$ if neutral, and $0$ if contradict; average is reported.
Key Results (NLI Score Average):

System	Fandom Avg	Bandori Avg
Vanilla	55.6	65.5
Fine-tune	45.7	62.9
RICL	56.0	68.9
ETA	56.9	72.3
Human	58.3	71.3
Codified-Human	59.3	71.9
CDT	60.8	77.7
CDT-Lite	61.0	79.0

Removal of clustering, instruction-following embeddings, or validation degrades performance by 1–2 points. Performance scales monotonically with dataset size (Peng et al., 15 Jan 2026).

5. Example Construction

Consider the following illustrative dataset $D$ for a "Hero":

"Dark tunnel ahead..." $\to$ "Hero lights torch."
"Walls glint in darkness..." $\to$ "Hero lights torch."
"Monster roar nearby..." $\to$ "Hero draws sword."

Cluster $(1,2)$ ; LLM hypothesizes $q_1=$ "Does the scene mention darkness?", $h_1=$ "Hero lights torch." Accepted as $acc=1.0$ .
Cluster $(3)$ ; LLM hypothesizes $q_2=$ "Does the scene indicate presence of hostile creature?", $h_2=$ "Hero draws sword." Accepted.

Final CDT in LaTeX: $\begin{array}{l} \text{root: }H=\emptyset\ \quad\begin{cases} [q_1:\text{“mention darkness?”}]\to v_1,\;H_{v_1}=\{\text{“Hero lights torch.”}\}\ [q_2:\text{“hostile creature?”}]\to v_2,\;H_{v_2}=\{\text{“Hero draws sword.”}\} \end{cases} \end{array}$

CDT offers improvements over both hand-authored codified human profiles and other induction methods. For Fandom, CDT-Lite outperforms Codified Human by +1.7 points (61.0 vs 59.3 NLI avg); for Bandori, by +7.1 points (79.0 vs 71.9). Overall, CDTs show relative improvements of 3–10% over the strongest human and prior data-driven baselines (Peng et al., 15 Jan 2026).

While CDT leverages a tree structure reminiscent of classic decision trees, the construction and inference are semantically adapted to natural language scene affordances and behavioral logic, not feature-threshold predicates. By contrast, computational graph representations of traditional binary and oblique decision trees have been formalized via parallel predicate evaluation and bitvector arithmetic over structured inputs, supporting soft traversals and hybridization with differentiable models (Zhang, 2021). CDTs focus distinctly on context-conditional action logic derived from narrative data rather than numerical features.

7. Limitations and Future Developments

Current CDT methodology is restricted to offline (non-continual) construction and induction solely from narrative storyline data, without leveraging canonical trait priors or multimodal context (e.g., game state). Future directions include:

Joint CDT induction for multiple interacting characters.
Online refinement and continual learning from live agent interaction.
Multimodal CDT expansion incorporating event logs and real-time state signals.

These directions address domains where principled, interpretable, and efficiently updatable behavioral logic is required for robust agent grounding under complex, evolving contexts (Peng et al., 15 Jan 2026).

Markdown Upgrade to Chat

References (2)

Deriving Character Logic from Storyline as Codified Decision Trees (2026)

Yet Another Representation of Binary Decision Trees: A Mathematical Demonstration (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Codified Decision Tree (CDT).