Incremental TD Networks
- The incremental TD network framework extends traditional TD learning by predicting interdependent future events using eligibility traces and action-conditional features.
- It employs a directed acyclic question network paired with an answer network to support online learning in both discrete and continuous domains.
- Empirical studies show rapid convergence and low prediction error in modeling noisy, partially observable dynamical systems with non-Markovian characteristics.
An incremental TD (Temporal-Difference) network is a generalization of temporal-difference learning extended to a network of interdependent scalar predictions. Rather than relying on a single prediction updated from itself at a future time, incremental TD networks (TDNs) represent sets of predictions, each of which may depend on future values of other predictions as well as observations, enabling a broad class of inter-predictive relationships. They are formulated for both discrete and continuous domains, including those with continuous actions and observations, and support fully incremental online updating with eligibility traces and action-conditional representations. The incremental TD network algorithm enables learning structured predictive models—including predictive state representations—of partially observable dynamical systems, which extend beyond the representational capacity of traditional value methods and allow for predictive modeling of non-Markovian, partially observed, or continuous systems (Vigorito, 2012, Sutton et al., 2015).
1. Formal Framework and Definitions
An incremental TD network operates on a controlled dynamical system with observations and actions , governed by an unknown stochastic kernel. The agent maintains a history , updating after each action is applied and observation received. Rather than a single value function, the model employs a set of scalar predictions representing answers to a user-defined set of "questions" (predictions about future observations or events).
The network is defined by two main structures:
- Question Network (): A directed acyclic graph of nodes (predictions ), each with its parent as either an observation feature function or another prediction node . Dependencies may be conditioned on the similarity between the previous action and a center, measured via a gating function 0.
- Answer Network: Parameterized by a weight matrix 1 and a feature mapping 2 that integrates previous predictions, last action, and current observation into a feature input 3, generating 4 (optionally passing through a nonlinear activation such as identity or sigmoid) (Vigorito, 2012, Sutton et al., 2015).
2. Network Architecture and Action Conditioning
Architecturally, each root node corresponds to an observation feature function, such as a Gaussian RBF 5 centered at 6. Chains (or more general DAGs) of prediction nodes are grown from these roots, with each node possibly conditioned on an action-gating function 7—also typically a Gaussian RBF over the action space. Predictions at node 8 at time 9 may thus reflect, for example, the expected value of 0 one (or multiple) steps ahead if an action 1 similar to 2 is taken (Vigorito, 2012).
The feature vector at each time step is constructed as
3
where 4 denotes the number of prediction nodes, 5 and 6 are the number of observation and action features used, and 7 is the input dimension.
The action gating function 8 provides a continuous responsibility signal 9 for whether the TD error on 0 is incurred at time 1. This allows for smooth (rather than binary) conditioning and trace decay (Vigorito, 2012).
3. Incremental Update Procedure and Learning Rules
For each prediction node 2 at time 3, the incremental update is defined via:
- Target Definition (4):
- If the parent is an observation feature 5: 6
- If the parent is another prediction 7: 8
- TD Error (9):
0
- Eligibility Traces (1):
2
Each trace is maintained per output row; 3 is the trace decay parameter.
- Weight Update:
4
with step size 5.
The general step-wise procedure—applicable for both the real-valued continuous and the finite case—involves building 6, computing current predictions 7, performing the action and observing 8, computing targets and TD errors per node, updating traces and weights incrementally, and shifting state. This routine supports fully incremental, online learning (Vigorito, 2012, Sutton et al., 2015).
4. Extensions from Discrete to Continuous Domains
Incremental TD networks extend naturally to continuous spaces by:
- Replacing discrete observation and action symbols with (typically Gaussian RBF) features 9 and 0 over continuous vector spaces 1 and 2;
- Using real-valued feature vectors and gating signals, permitting partial eligibility and smooth conditioning, critical for high-dimensional or noisy environments;
- Concatenating previous predictions into the feature vector, enabling representation of history and information flow, thereby facilitating learning in non-Markovian or partially observable systems (Vigorito, 2012).
These extensions ensure that eligibility traces no longer die “instantly” upon action mismatch, and ties the generalization ability of the network closely to the selection, width, and tiling of the chosen 3 and 4 basis functions.
5. Algorithmic Pseudocode and Workflow
A concise algorithmic outline for incremental TD network learning, capturing all the above elements, is as follows (for 5 nodes, feature functions 6 and 7, weight matrix 8, eligibility traces 9, and step sizes 0 and 1):
7 (Vigorito, 2012, Sutton et al., 2015)
6. Convergence, Computational Complexity, and Empirical Performance
Under standard linear TD learning assumptions (bounded features, small enough 2, and proper gating functions), incremental TD networks converge in mean to the fixed-point of the projected Bellman operator defined by the question network (Vigorito, 2012). The soft gates 3 do not affect convergence provided they are bounded away from zero and one as necessary.
Per-step computational cost is 4, driven by feature construction, prediction calculation, and 5 weight/traces updates. The principal memory requirement is for 6 weight storage and 7 eligibility traces (Vigorito, 2012).
Empirical studies on five domains highlighted rapid convergence and robustness under partial observability and noise:
- On controlled and uncontrolled 1D square and sine waves (continuous domains), root mean squared prediction errors converged to levels comparable to observation noise (≈0.05 in 8 steps with 9 φ functions and depth 0).
- In partially observable mountain-car, velocity was recovered via multi-step prediction chains, achieving RMSE ≈0.04.
- Depth 1 generally provided a strong trade-off between model expressiveness and stability; accuracy and convergence rates were robust even with a small number of φ and ψ basis functions (e.g., 2 per dimension).
These results demonstrate that fully incremental TD networks can reliably learn accurate, robust predictive models in continuous, noisy, partially observable dynamical systems (Vigorito, 2012).
7. Relation to Conventional TD Learning and Broader Implications
Incremental TD networks strictly generalize classical TD(0) learning. With a single node, identity feature encoding, and appropriate definition of 3 and 4, the framework collapses to standard value-function TD learning. General TD networks, however, allow each prediction 5 to target arbitrary functions of both future observations and other nodes' predictions, facilitating "learn a guess from arbitrary other guesses" (Sutton et al., 2015).
The conditioning vector 6 allows for selective or action-conditional credit assignment, supporting structured, multi-step, and action-dependent predictions in a unified update mechanism. This enables simultaneous learning of many interrelated, action-conditional predictions using local TD error signals, including predictive state representations and fixed-interval predictions that are not possible with conventional TD methods.
This broadens the applicability of the TD learning paradigm to a rich class of inter-predictive world models—enabling application to non-Markov problems, highly structured dynamical systems, and the construction of end-to-end predictive representations of environment dynamics (Sutton et al., 2015, Vigorito, 2012).