Capability-Oriented Value Functions

Updated 10 February 2026

Capability-oriented value functions explicitly represent the full range of achievable outcomes and policies, enabling efficient multitask learning and zero-shot skill composition.
They transform reward-centric value estimation into goal-indexed maps that facilitate rapid policy adaptation and combinatorial skill composition for lifelong learning.
Empirical and theoretical findings demonstrate their robustness in planning, sample efficiency, and adaptive training in both reinforcement learning and large language model systems.

A capability-oriented value function is a class of value function in reinforcement learning and sequential decision-making that explicitly represents or parameterizes the full set of outcomes an agent can achieve, as well as the pathways or policies needed to reach each outcome. Unlike standard, task-specific value functions—which encode only how to optimize a fixed reward or complete a pre-specified objective—capability-oriented value functions enumerate, quantify, and generalize the agent’s abilities across a broad set of internal or externally specified goals. They serve as a general-purpose, goal-indexed map, supporting efficient multitask performance, zero-shot transfer, skill composition, and recasting of value estimation into the context of evolving model or agent capabilities.

1. Formal Definitions and Instantiations

The canonical instantiation of the capability-oriented value function is the World Value Function (WVF), as defined in Nangue Tasse et al. (Tasse et al., 2022, Tasse et al., 2022). In this framework, consider an MDP $M = (\mathcal{S}, \mathcal{A}, P, R)$ , possibly with deterministic $P$ ; the agent constructs an internal goal space

$G = \{\, s \in \mathcal{S} \mid s \text{ is observed as a terminal (absorbing) state} \,\}.$

The agent defines, for each internal goal $g \in G$ , a goal-conditioned reward by reshaping the environment's reward and assigning a large penalty $C_{\neg g}$ for terminating at any goal $g' \ne g$ :

$\bar{R}(s, g, a, s') = \begin{cases} C_{\neg g} & \text{if $g \ne s' $and$ s' \in G$,} \ R(s, a, s') & \text{otherwise.} \end{cases}$

The capability-oriented World Q-function is then

$\bar{Q}(s, g, a) = \mathbb{E}_{s'} \left[ \bar{R}(s, g, a, s') + \max_{a'} \bar{Q}(s', g, a') \right].$

$\bar{Q}$ parameterizes, for each $(s, g, a)$ , how well the agent can reach any goal $g$ and with what return or cost. The associated value $\bar{V}(s,g) = \max_a \bar{Q}(s,g,a)$ encodes the best value starting from $s$ aiming for $g$ (Tasse et al., 2022).

Further generalizations adopt capability-oriented value estimation in domains beyond traditional RL. In LLM training, $V_0$ is a generalist value model that predicts, from a representation of policy capability (as a context of instruction–success pairs), the likely performance on a new prompt, reifying capability into an explicit input alongside the task (Zhang et al., 3 Feb 2026).

In multi-agent and game-theoretic settings, the “capability transfer function” maps each agent’s strategy-space size (capability) to its equilibrium payoff, enabling explicit analysis of how increased action or planning flexibility translates into achievable value (Jia et al., 2022).

2. Learning Algorithms and Convergence Properties

Capability-oriented value functions are trained by propagating learning signals across all achievable goals or targets simultaneously. In the WVF paradigm, every observed transition updates $\bar{Q}(s, g, a)$ for all $g \in G$ using a unified empirical stream, in contrast to separate, task-specific runs.

For each episode:

After observing start state $s$ and intended goal $g$ , the agent acts according to an $\epsilon$ -greedy policy with respect to $\bar{Q}(s, g, a)$ .
Upon transitioning to $s'$ , $s'$ is added to $G$ if newly absorbing.
For all $g' \in G$ , the agent computes

$\delta = \bar{R}(s, g', a, s') + \max_{a'} \bar{Q}(s', g', a') - \bar{Q}(s, g', a)$

and updates each $\bar{Q}(s, g', a)$ accordingly.

As formalized in (Tasse et al., 2022), standard assumptions on step-size (e.g., Robbins-Monro conditions on $\alpha$ ) guarantee convergence of each goal-conditioned Q-learning process to its fixed point $\bar{Q}^*$ . The parallelization across goals ensures the acquisition of a full mastery map.

In sample-efficient or deep-RL settings, practitioners may parameterize capability-oriented value functions as neural networks taking both state (or prompt) and goal (or policy context) as input—learning a general mapping from state-goal (or context-task) pairs to expected return.

3. Theoretical Properties and Knowledge Representation

A defining strength of capability-oriented value functions lies in their dual function as knowledge bases:

Reachability and reward for all goals: $\bar{Q}(s,g,a)$ encodes not just how to achieve the current task, but all tasks definable within the internal goal space. This supports fine-grained transfer and flexible task adaptation (Tasse et al., 2022, Tasse et al., 2022).
Implicit system identification: When $G = \mathcal{S}$ , the Bellman optimality equations over $\{\bar{Q}^*(s,g,a)\}_{g \in \mathcal{S}}$ uniquely identify the transition kernel $P$ ; solving the linear system allows recovery of environment dynamics for planning (Tasse et al., 2022).
Zero-shot policy and value recovery: For any external terminal reward $R_\mathrm{new}$ , action-value functions can be instantly adapted by reweighting the existing WVF—eliminating retraining and enabling lifelong learning.
Combinatorial skill composition: Logical operations on learned WVFs (e.g., AND, OR, NOT) produce value functions for composite skills or goals, vastly expanding the agent's skill set without additional environment interaction.

Capability-oriented value functions thus characterize, in precise mathematical terms, the full capability set—the affordances—of an agent in its environment, and support upstream knowledge-based planning.

4. Practical Algorithms in Scalable RL and LLM Systems

In practical large-scale learning, capability-oriented value functions are central for efficient training, sample allocation, and model routing:

Generalist Value Models ( $V_0$ ): $V_0(x, \mathcal{C}_\pi)$ predicts, for prompt $x$ and context $\mathcal{C}_\pi$ summarizing model $\pi$ ’s recent performance, the success rate on $x$ . Unlike traditional critics, $V_0$ requires no coupled retraining—new capabilities are profiled via context, not parameters (Zhang et al., 3 Feb 2026).
Capability-Oriented Budget Allocation: CoBA-RL introduces a closed-form function $V_\mathrm{cap}(B_i, p_i)$ combining task pass-rate $p_i$ , model failure rate, and diminishing-returns saturation, to optimize rollout budgets adaptively for RL with verifiable rewards in LLMs. A heap-based greedy allocation maximizes expected total value in $O(B_\mathrm{total} \log M)$ complexity and yields consistent empirical performance gains (Yao et al., 3 Feb 2026).

Such algorithms reify capacity as both an explicit function argument and as a learned knowledge structure, separating trainable representations from evolving agent or policy parameters.

5. Extensions: Skill-Centric Abstractions and Affordances

The concept of capability-oriented value function undergirds several adjacent ideas:

Value Function Spaces (VFS): Embeds states as $k$ -dimensional vectors of skill values $\phi(s) = [V^{\pi_1}(s), \dots, V^{\pi_k}(s)]$ , mapping directly to the set of what can be done from each state, robustly ignoring distractors uncorrelated with skill success. These embeddings support both model-free and model-based higher-level reasoning (Shah et al., 2021).
General Value Functions and Affordances: Treating each skill (option) as a policy $\tau$ with its own general value function $V_c^\tau(s)$ , the entire suite of learned GVFs constitutes a map of affordances—state-dependent answers to “what can I do, and how well?”—with practical implications in robotic control, perception, and hierarchical planning (Graves et al., 2020).

This generalizes the paradigm: by stacking or parameterizing across skills, options, or policies, value functions transition from reward-specific predictors to comprehensive capability maps.

6. Applications and Empirical Findings

Capability-oriented value functions have demonstrated utility in a range of domains:

Lifelong and multitask RL: Agents equipped with WVFs achieve combinatorial skill composition, zero-shot task adaptation, and rapid transfer across new goals (Tasse et al., 2022, Tasse et al., 2022).
Efficient LLM post-training: $V_0$ enables cost-performance routing and efficient sampling budget allocation, achieving Pareto-optimal trade-offs that outperform fixed-budget and heuristic baselines in large-scale model selection and training (Zhang et al., 3 Feb 2026).
Robust abstraction and planning: VFS and GVF-based approaches support robust long-horizon planning and reasoning under systematic distractors, yielding superior generalization and sample efficiency (Shah et al., 2021, Graves et al., 2020).
Game theory and multi-agent systems: Capability transfer functions provide explicit, closed-form characterization of how strategy space expansion translates into equilibrium payoffs in mixed capability games (Jia et al., 2022).

Empirically, these approaches deliver substantial gains in sample efficiency, generalization, zero-shot performance, and robustness to nonstationarity and distribution shift.

7. Theoretical Significance and Future Directions

Capability-oriented value functions recast the core objectives of value prediction and RL from fixed-task optimization to capability maximization and knowledge representation. By supplying a structured, learnable, and compositional inventory of agent affordances, they support advances across:

Zero-shot and continual learning, where skills are transferred, composed, and adapted with minimal re-exploration.
Hierarchical abstraction, with value functions serving as the basis for high-level policy selection and dynamic option chaining.
Scalable model selection and adaptive training in foundation models, where evolving capabilities are profiled and utilized without per-policy retraining or synchronization.

Ongoing directions include joint optimization of skills and their value function representation, unsupervised capability discovery leading to unsupervised abstraction, and extension to partially observable and multi-agent regimes (Tasse et al., 2022, Zhang et al., 3 Feb 2026, Shah et al., 2021). The capability-oriented value function paradigm thus continues to underpin unified advances in adaptive intelligence, knowledge-based planning, and efficient, generalizable learning.