Unsupervised Reward Functions

Updated 13 August 2025

Unsupervised reward functions are mechanisms that derive intrinsic rewards in RL by exploiting agent experiences and environmental cues, reducing reliance on manual reward specification.
They integrate semi-supervised frameworks and IRL-inspired methods to leverage limited labeled data and self-demonstrative trajectories for enhanced policy generalization.
Applications in high-dimensional domains like robotics and visual control demonstrate that these methods improve cumulative rewards and mitigate distributional shifts.

Unsupervised reward functions are mechanisms by which reinforcement learning (RL) agents identify, construct, or infer task objectives in the absence of explicit, extrinsic reward signals. Instead of relying on manually specified rewards or supervised annotation, these functions typically exploit the agent's prior experiences, environmental structure, or intrinsic metrics such as diversity, predictability, or information gain to shape behavior. This paradigm is motivated by challenges in real-world domains—such as robotics or high-dimensional perception—where explicit reward engineering is costly, error-prone, or infeasible. Unsupervised reward function design underpins a variety of approaches, including semi-supervised reinforcement learning, skill discovery, inverse RL from self-demonstrations, and auto-supervising exploration in unlabeled Markov decision processes (MDPs).

1. Problem Formalization and Semi-Supervised Frameworks

A central formulation for unsupervised reward function design is the semi-supervised RL framework, as presented in "Generalizing Skills with Semi-Supervised Reinforcement Learning" (Finn et al., 2016). In this setting, environment instances are partitioned into two sets: a labeled set of MDPs $\mathcal{L}$ , in which the reward function $r(s,a)$ is available, and an unlabeled set $\mathcal{U}$ , where no explicit reward is specified. The agent’s objective is to learn a policy $\pi$ that maximizes average return over the full distribution of tasks:

$J(\pi) = \mathbb{E}_{m\sim D}\left[\mathbb{E}_{\tau\sim\pi}\left[\sum_t \gamma^t r(s_t,a_t)\right]\right],$

where $D$ spans both $\mathcal{L}$ and $\mathcal{U}$ . The challenge is to generalize policy improvement from settings with explicit reward signal to those without, effectively leveraging rewards in $\mathcal{L}$ as "demonstrations" to infer or reconstruct an objective in $\mathcal{U}$ .

In practical terms, the learning process comprises:

Direct policy optimization in labeled MDPs using known $r(s,a)$ ,
Transfer of experience to unlabeled MDPs, where inferred or proxy rewards must guide learning in the absence of $r$ ,
Joint training (e.g., of a neural policy) on merged $\mathcal{L} \cup \mathcal{U}$ data to enhance generalization across task variation.

2. IRL-Inspired Inference of Reward Functions

The bootstrapping of reward functions in unlabeled settings leverages an algorithmic structure similar to inverse reinforcement learning (IRL). After initial supervised training on $\mathcal{L}$ , the agent’s behavior embodies information about the task objective. An IRL-like procedure treats the agent’s own rollouts in labeled settings as "optimal" demonstrations. By comparing the trajectories encountered in the unlabeled setting with these, it infers a reward signal that, if optimized, would reproduce the observed behaviors in $\mathcal{L}$ :

The agent’s historical policy acts as a set of demonstrations,
The IRL objective constructs a reward model that matches the distribution over trajectories in labeled MDPs,
This inferred reward is used to provide policy optimization signals in $\mathcal{U}$ , aligning exploration towards behaviors judged optimal under the (now proxy) objective.

Mathematically, the reward inference in unlabeled MDPs often minimizes a discrepancy function such as:

$\min_r \mathcal{L}(r) = \mathbb{E}_{\tau\sim \pi_{\text{temp}}}\left[\|r(s,a)-\hat{r}(s,a)\|^2\right],$

where $\hat{r}(s,a)$ is the proxy/inferred reward, and the expectation is over trajectories generated by the temporary or demonstration policy.

3. Generalization to High-Dimensional and Visual Domains

A major impetus for unsupervised reward learning is the need for generalization in high-dimensional domains, especially those involving raw sensory input (e.g., images). The semi-supervised RL strategy demonstrates robustness in this context by:

Training deep neural network policies on both labeled and unlabeled data,
Employing inferred reward signals in $\mathcal{U}$ as regularizers, guiding the network to robust behavior,
Demonstrating success in visual control tasks such as robotic manipulation or locomotion, maintaining performance despite domain shift or background variation.

Unsupervised reward functions thus serve a dual purpose: supplementing sparse extrinsic rewards, and regularizing learning in the presence of high variability and partial supervision.

4. Comparative Empirical Evaluation

Empirical studies confirm that incorporating unsupervised (or semi-supervised) reward functions leads to improved policy generalization. Key findings include:

Policies trained via this approach exhibit superior average cumulative rewards, often matching or outperforming fully supervised baselines,
Integration of inferred reward functions narrows the performance gap between policies trained on fully labeled and partially labeled environments,
Quantitative evaluations use standard RL objectives (e.g., maximization of discounted returns), as well as direct comparison of inferred and "ground truth" rewards,
Performance improvements are especially pronounced in domains where direct reward learning suffers from limited coverage or overfitting.

5. Connections, Limitations, and Extensions

The semi-supervised reward learning paradigm provides a template for broader unsupervised reward discovery. Key points include:

The approach leverages self-supervised signals ("agent as its own demonstrator") as an alternative to external annotation,
It partially addresses the reward ambiguity problem in IRL by anchoring inference to regions of state space where the reward is specified,
It mitigates distribution shift between labeled and unlabeled domains by leveraging the generalization properties of deep function approximators and explicit regularization,
The method naturally extends to settings beyond supervised labeling, supporting autonomous reward signal construction and intrinsic motivation schemes.

Remaining challenges include:

Fully unsupervised settings (no access to labeled $\mathcal{L}$ ), where reward ambiguity and alignment remain significant open problems,
Inference in highly nonstationary or multi-task environments, where the temporal transfer of reward signal may be inconsistent,
Scaling inference techniques (e.g., efficient IRL) to high-dimensional or online domains.

6. Theoretical Formulation and Policy Learning Guarantees

The underlying mathematical models provide guarantees and insights:

RL objective formulation: $\max_\pi \mathbb{E}_{\tau\sim\pi}[\sum_t \gamma^t r(s_t, a_t)]$ ,
Empirical error metrics: $\mathbb{E}_{\tau\sim\pi_{\text{temp}}}[\|r(s,a)-\hat{r}(s,a)\|^2]$ ,
Policy optimality in unlabeled domains bounded in terms of the extent of reward generalization error: as approximation error diminishes, policy performance approaches that of policies trained under fully labeled supervision.

These models ensure that performance improvements are at least as good as those attainable by directly learning from the available supervised reward alone, and often strictly better when leveraging unlabeled experience.

7. Implications for Practical Unsupervised Reward Function Design

The semi-supervised RL conceptual framework offers actionable strategies for practical implementation:

Leveraging agent self-demonstrations as reward inference data in absence of explicit supervisory signals,
Utilizing IRL-style algorithms to reconstruct or extrapolate rewards in unlabeled environments, mitigating the need for exhaustive reward engineering,
Structuring policy optimization to exploit partial supervision, increasing robustness and generalizability in high-variability applications,
Addressing challenges of reward ambiguity and distributional shift via architectural regularization and shared representation learning across $\mathcal{L}$ and $\mathcal{U}$ .

The methodology provides a pathway for agents to autonomously acquire objectives in realistically complex settings, especially where explicit, dense reward specification is unfeasible or incomplete. Extensions of these ideas underpin a wide range of modern approaches in unsupervised skill acquisition, control from observation, and scalable autonomous system design.

PDF Markdown Chat (Pro)

References (1)

Generalizing Skills with Semi-Supervised Reinforcement Learning (2016)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Unsupervised Reward Functions.