Papers
Topics
Authors
Recent
Search
2000 character limit reached

Incremental TD Networks

Updated 26 April 2026
  • The incremental TD network framework extends traditional TD learning by predicting interdependent future events using eligibility traces and action-conditional features.
  • It employs a directed acyclic question network paired with an answer network to support online learning in both discrete and continuous domains.
  • Empirical studies show rapid convergence and low prediction error in modeling noisy, partially observable dynamical systems with non-Markovian characteristics.

An incremental TD (Temporal-Difference) network is a generalization of temporal-difference learning extended to a network of interdependent scalar predictions. Rather than relying on a single prediction updated from itself at a future time, incremental TD networks (TDNs) represent sets of predictions, each of which may depend on future values of other predictions as well as observations, enabling a broad class of inter-predictive relationships. They are formulated for both discrete and continuous domains, including those with continuous actions and observations, and support fully incremental online updating with eligibility traces and action-conditional representations. The incremental TD network algorithm enables learning structured predictive models—including predictive state representations—of partially observable dynamical systems, which extend beyond the representational capacity of traditional value methods and allow for predictive modeling of non-Markovian, partially observed, or continuous systems (Vigorito, 2012, Sutton et al., 2015).

1. Formal Framework and Definitions

An incremental TD network operates on a controlled dynamical system with observations otRoo_t \in \mathbb{R}^o and actions atRaa_t \in \mathbb{R}^a, governed by an unknown stochastic kernel. The agent maintains a history ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t), updating after each action is applied and observation received. Rather than a single value function, the model employs a set of nn scalar predictions ytRn\mathbf{y}_t \in \mathbb{R}^n representing answers to a user-defined set of "questions" (predictions about future observations or events).

The network is defined by two main structures:

  • Question Network (QQ): A directed acyclic graph of nn nodes (predictions yiy^i), each with its parent as either an observation feature function ϕj\phi_j or another prediction node yky^k. Dependencies may be conditioned on the similarity between the previous action and a center, measured via a gating function atRaa_t \in \mathbb{R}^a0.
  • Answer Network: Parameterized by a weight matrix atRaa_t \in \mathbb{R}^a1 and a feature mapping atRaa_t \in \mathbb{R}^a2 that integrates previous predictions, last action, and current observation into a feature input atRaa_t \in \mathbb{R}^a3, generating atRaa_t \in \mathbb{R}^a4 (optionally passing through a nonlinear activation such as identity or sigmoid) (Vigorito, 2012, Sutton et al., 2015).

2. Network Architecture and Action Conditioning

Architecturally, each root node corresponds to an observation feature function, such as a Gaussian RBF atRaa_t \in \mathbb{R}^a5 centered at atRaa_t \in \mathbb{R}^a6. Chains (or more general DAGs) of prediction nodes are grown from these roots, with each node possibly conditioned on an action-gating function atRaa_t \in \mathbb{R}^a7—also typically a Gaussian RBF over the action space. Predictions at node atRaa_t \in \mathbb{R}^a8 at time atRaa_t \in \mathbb{R}^a9 may thus reflect, for example, the expected value of ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)0 one (or multiple) steps ahead if an action ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)1 similar to ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)2 is taken (Vigorito, 2012).

The feature vector at each time step is constructed as

ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)3

where ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)4 denotes the number of prediction nodes, ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)5 and ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)6 are the number of observation and action features used, and ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)7 is the input dimension.

The action gating function ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)8 provides a continuous responsibility signal ht=(o0,a0,o1,a1,...,ot)h_t = (o_0, a_0, o_1, a_1, ..., o_t)9 for whether the TD error on nn0 is incurred at time nn1. This allows for smooth (rather than binary) conditioning and trace decay (Vigorito, 2012).

3. Incremental Update Procedure and Learning Rules

For each prediction node nn2 at time nn3, the incremental update is defined via:

  • Target Definition (nn4):
    • If the parent is an observation feature nn5: nn6
    • If the parent is another prediction nn7: nn8
  • TD Error (nn9):

ytRn\mathbf{y}_t \in \mathbb{R}^n0

  • Eligibility Traces (ytRn\mathbf{y}_t \in \mathbb{R}^n1):

ytRn\mathbf{y}_t \in \mathbb{R}^n2

Each trace is maintained per output row; ytRn\mathbf{y}_t \in \mathbb{R}^n3 is the trace decay parameter.

  • Weight Update:

ytRn\mathbf{y}_t \in \mathbb{R}^n4

with step size ytRn\mathbf{y}_t \in \mathbb{R}^n5.

The general step-wise procedure—applicable for both the real-valued continuous and the finite case—involves building ytRn\mathbf{y}_t \in \mathbb{R}^n6, computing current predictions ytRn\mathbf{y}_t \in \mathbb{R}^n7, performing the action and observing ytRn\mathbf{y}_t \in \mathbb{R}^n8, computing targets and TD errors per node, updating traces and weights incrementally, and shifting state. This routine supports fully incremental, online learning (Vigorito, 2012, Sutton et al., 2015).

4. Extensions from Discrete to Continuous Domains

Incremental TD networks extend naturally to continuous spaces by:

  • Replacing discrete observation and action symbols with (typically Gaussian RBF) features ytRn\mathbf{y}_t \in \mathbb{R}^n9 and QQ0 over continuous vector spaces QQ1 and QQ2;
  • Using real-valued feature vectors and gating signals, permitting partial eligibility and smooth conditioning, critical for high-dimensional or noisy environments;
  • Concatenating previous predictions into the feature vector, enabling representation of history and information flow, thereby facilitating learning in non-Markovian or partially observable systems (Vigorito, 2012).

These extensions ensure that eligibility traces no longer die “instantly” upon action mismatch, and ties the generalization ability of the network closely to the selection, width, and tiling of the chosen QQ3 and QQ4 basis functions.

5. Algorithmic Pseudocode and Workflow

A concise algorithmic outline for incremental TD network learning, capturing all the above elements, is as follows (for QQ5 nodes, feature functions QQ6 and QQ7, weight matrix QQ8, eligibility traces QQ9, and step sizes nn0 and nn1):

yiy^i7 (Vigorito, 2012, Sutton et al., 2015)

6. Convergence, Computational Complexity, and Empirical Performance

Under standard linear TD learning assumptions (bounded features, small enough nn2, and proper gating functions), incremental TD networks converge in mean to the fixed-point of the projected Bellman operator defined by the question network (Vigorito, 2012). The soft gates nn3 do not affect convergence provided they are bounded away from zero and one as necessary.

Per-step computational cost is nn4, driven by feature construction, prediction calculation, and nn5 weight/traces updates. The principal memory requirement is for nn6 weight storage and nn7 eligibility traces (Vigorito, 2012).

Empirical studies on five domains highlighted rapid convergence and robustness under partial observability and noise:

  • On controlled and uncontrolled 1D square and sine waves (continuous domains), root mean squared prediction errors converged to levels comparable to observation noise (≈0.05 in nn8 steps with nn9 φ functions and depth yiy^i0).
  • In partially observable mountain-car, velocity was recovered via multi-step prediction chains, achieving RMSE ≈0.04.
  • Depth yiy^i1 generally provided a strong trade-off between model expressiveness and stability; accuracy and convergence rates were robust even with a small number of φ and ψ basis functions (e.g., yiy^i2 per dimension).

These results demonstrate that fully incremental TD networks can reliably learn accurate, robust predictive models in continuous, noisy, partially observable dynamical systems (Vigorito, 2012).

7. Relation to Conventional TD Learning and Broader Implications

Incremental TD networks strictly generalize classical TD(0) learning. With a single node, identity feature encoding, and appropriate definition of yiy^i3 and yiy^i4, the framework collapses to standard value-function TD learning. General TD networks, however, allow each prediction yiy^i5 to target arbitrary functions of both future observations and other nodes' predictions, facilitating "learn a guess from arbitrary other guesses" (Sutton et al., 2015).

The conditioning vector yiy^i6 allows for selective or action-conditional credit assignment, supporting structured, multi-step, and action-dependent predictions in a unified update mechanism. This enables simultaneous learning of many interrelated, action-conditional predictions using local TD error signals, including predictive state representations and fixed-interval predictions that are not possible with conventional TD methods.

This broadens the applicability of the TD learning paradigm to a rich class of inter-predictive world models—enabling application to non-Markov problems, highly structured dynamical systems, and the construction of end-to-end predictive representations of environment dynamics (Sutton et al., 2015, Vigorito, 2012).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Incremental TD Network.