Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

General Value Functions

Updated 27 July 2025
  • General Value Functions (GVFs) are defined as temporally extended predictions about arbitrary sensorimotor signals under specific policies.
  • The framework enables scalable parallel learning, exemplified by the Horde architecture that updates hundreds to thousands of GVFs in real time using shared features and GTD(λ).
  • GVFs ensure stable off-policy learning through online MSPBE estimators and robust algorithms, demonstrating practical feasibility in real-world robotic systems.

General Value Functions (GVFs) are a formalism in reinforcement learning for representing temporally extended predictions about arbitrary signals of interest under specific policies. Unlike traditional value functions, which focus solely on expected cumulative rewards, GVFs allow agents to make structured, policy-contingent predictions about any sensorimotor or environmental signal—a capacity that supports continual, scalable, and adaptive learning of rich world knowledge.

1. Formal Definition and Predictive Role

A General Value Function for the iith predictive question is defined by: v(i)(s)=Eπ(i)[k=0(γ(i))krt+k+1(i)st=s]v^{(i)}(s) = \mathbb{E}_{\pi^{(i)}} \left[ \sum_{k=0}^{\infty} \left( \gamma^{(i)} \right)^k r_{t+k+1}^{(i)} \mid s_t = s \right] where:

  • r(i)r^{(i)} is the target signal (cumulant, e.g., a specific sensor reading),
  • γ(i)\gamma^{(i)} is the discount factor (determining the time span or pseudo-termination for the prediction),
  • π(i)\pi^{(i)} is the target policy (the hypothetical behavior for which the prediction is made).

GVFs extend classical value functions by supporting arbitrary cumulants, discounting, and policies. Thus, GVFs answer general “predictive questions” about the unfolding data stream, such as “What will the average pressure reading be if the agent follows policy π\pi?” or “How long until the battery depletes if the agent acts according to policy π\pi?” This expressiveness enables an agent to accumulate diverse knowledge about immediate or long-term outcomes tied directly to its experience, crucial for life-long autonomy (White et al., 2012).

2. Scalable Parallel Learning: The Horde Architecture

The Horde architecture is a real-time, scalable system designed to acquire hundreds or thousands of GVFs in parallel. Its architecture features:

  • Parallel, independent GVF learners, each associated with a unique predictive question.
  • Shared feature representations (typically via tile coding) across learners to amortize computation and promote data efficiency.
  • Use of off-policy learning algorithms (specifically GTD(λ\lambda), a gradient temporal difference method) to allow learning about many target policies while behaving under a single data-generating behavior policy.

Horde realizes the GVF formalism by updating all predictions continuously and simultaneously, limited only by linear computational complexity per step. On physical robots, this enables accurate learning of hundreds (and even 1000) on-policy and off-policy GVFs within a strict real-time compute budget (e.g., within an 85--100ms cycle window) (White et al., 2012).

3. Off-Policy Learning and Stability

Core to GVF scalability is off-policy learning, in which the target policy π\pi for a given GVF is distinct from the behavior policy bb executed by the agent. This allows the system to learn predictions for “what-if” behaviors (e.g., “What sensor readings would I see if I always turned left, even as my actual behavior is random?”). The GTD(λ\lambda) algorithm addresses the stability issues that plague standard off-policy TD learning with function approximation. GTD(λ\lambda) uses:

  • Primary weights θ\theta and secondary weights ww for correction.
  • Importance sampling ratios ρt=π(atϕt)/b(atϕt)\rho_t = \pi(a_t|\phi_t) / b(a_t|\phi_t) to reweight updates.
  • Eligibility traces and specialized update rules for both θ\theta and ww:

δt=rt+1+γθtTϕt+1θtTϕt et=ρt[ϕt+γλet1] θt+1=θt+αv[δtetγ(1λ)(etTwt)ϕt+1] wt+1=wt+αw[δtet(ϕtTwt)ϕt]\begin{align*} \delta_t & = r_{t+1} + \gamma \theta_t^T \phi_{t+1} - \theta_t^T \phi_t \ e_t & = \rho_t [\phi_t + \gamma \lambda e_{t-1}] \ \theta_{t+1} & = \theta_t + \alpha_v [\delta_t e_t - \gamma(1-\lambda)(e_t^T w_t) \phi_{t+1}] \ w_{t+1} & = w_t + \alpha_w [\delta_t e_t - (\phi_t^T w_t)\phi_t] \end{align*}

This approach is proven to be convergent under off-policy sampling and linear function approximation, which is essential for robust, large-scale deployment of GVFs (White et al., 2012).

4. Progress Evaluation and Online MSPBE Estimators

Evaluating prediction accuracy for thousands of GVFs in scalable, embedded settings is nontrivial, especially as on-policy tests become infeasible at scale or for long horizons. The paper introduces two online estimators of the Mean Squared Projected BeLLMan Error (MSPBE), the objective minimized by GTD(λ\lambda):

MSPBE(θ)(δeˉ)Twt\text{MSPBE}(\theta) \approx (\bar{\delta e})^T w_t

where δeˉ\bar{\delta e} denotes an exponentially weighted average of the TD error δ\delta and eligibility trace ee. Two estimators are implemented:

  • Vector-based: (δe)Twt(\overline{\delta e})^T w_t
  • Scalar-based: δeTw\overline{\delta e^T w}

These can be incrementally updated and provide real-time measures of learning progress against the MSPBE objective. Experiments validate that these online estimators closely track the true MSPBE, even during abrupt environmental or parameter shifts. The scalar version offers especially low memory overhead, which is critical for resource-constrained robotic systems (White et al., 2012).

5. Real-Time Deployment and Empirical Results

On a physical robotic platform equipped with a diverse sensor suite, the authors demonstrate:

  • Real-time updating of 795 GVFs (across sensor signals, timescales, and policies) using shared tile-coded features and GTD(λ\lambda).
  • Scaling to 1000 parallel GVFs by expanding the target policy space using a combination of linearly parameterized Gibbs policies and randomized discount factors (e.g., γ=0.95\gamma = 0.95 for longer horizons). The scalar MSPBE estimator is used exclusively, as it allows learning to scale further without expensive on-policy tests.

These results establish the practical feasibility of GVFs as a substrate for lifelong, real-time, off-policy knowledge acquisition in real-world robotics. The system robustly tracks progress and adapts as required, a key requirement for autonomous, long-lived intelligent agents (White et al., 2012).

6. Scaling Strategies and Implementation Considerations

Efficient large-scale learning of GVFs requires:

  • Structural modularity: All learners are independent except for sharing features and input data streams.
  • Sparse encoding via tile coding: Promotes fast, scalable, and decorrelated feature representations suitable for both incremental learning and generalization across GVFs.
  • Use of GTD(λ\lambda) and its online error estimators: Ensures numerical stability, convergence, and the ability to monitor learning across thousands of signals.
  • Emphasis on off-policy learning: Empowers the system to answer a wide array of predictive “what-if” questions without behavioral interruption or undesirable on-policy data collection.

Real-world resource constraints influence architectural choices: linear computational scaling (in the number of predictions), low-latency updates (O(1)O(1) per-prediction per time step), and memory-efficient error tracking are central to deployment on embedded and robotic hardware.

7. Research Significance and Future Directions

The paradigm of GVFs as a representational substrate unlocks a new model for knowledge acquisition in reinforcement learning agents: agents need no longer be limited to reward-centric value functions but can accumulate and interpret an arbitrarily wide range of predictive world knowledge. The demonstrated scaling to thousands of parallel off-policy predictions in real-time on physical robots (White et al., 2012) constitutes a concrete step toward scalable artificial agents capable of lifelong, adaptive learning. This advances the field toward autonomous systems able to make sense of rich, high-dimensional sensorimotor experiences, continually update their knowledge, and adapt in open-ended environments with little to no direct supervision.

A plausible implication is that extending GVF-based architectures to non-linear function approximation, structured state construction, and compositional prediction frameworks will further enhance the representational richness and adaptability of lifelong learning agents.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)