Direct Reward Finetuning for Task-Aligned Models

Updated 16 October 2025

Direct Reward Finetuning is a method that adapts pre-trained models by optimizing intrinsic reward functions, bridging pretraining and task-aware optimization.
This approach efficiently replaces costly environment rollouts by repurposing pretraining artifacts like discriminators for reward matching using metrics such as EPIC loss.
Empirical results in robotics and language tasks show that these techniques improve sample efficiency and enable near-zero-shot policy adaptation.

Direct Reward Finetuning is a family of post-training techniques wherein a pre-trained model—often in vision, language, or control domains—is adapted by directly optimizing its behavior with respect to a user-defined reward function. Unlike conventional approaches that rely exclusively on imitation (supervised fine-tuning, SFT) or expensive environment interaction (standard reinforcement learning, RL), Direct Reward Finetuning leverages reward-driven objectives, often with minimal or no additional environment samples, to efficiently steer models towards task-aligned or human-preferred outcomes. Recent research demonstrates that these methods can bridge the gap between unsupervised pretraining and task-aware optimization, improve sample efficiency, and yield performance gains in areas ranging from robotics and vision to natural language and agent alignment.

1. Methodological Foundations

Direct Reward Finetuning encompasses a range of methodologies that inject reward-based optimization into the model adaptation pipeline. A canonical example is Intrinsic Reward Matching (IRM) (Adeniji et al., 2022), which unifies unsupervised skill discovery and task-specific adaptation in control settings. During unsupervised pretraining, a skill discriminator $q_\phi(\tau, z)$ (mapping trajectory-skill pairs to density ratios or discriminative scores) is trained alongside skill-conditioned policies. Standard practice discards this discriminator after pretraining; IRM instead reinterprets it as an intrinsic reward model:

$R^{\text{int}}(\tau, z) := q_\phi(\tau, z)$

For each extrinsic task reward $R^{\text{ext}}(\tau)$ , IRM identifies the skill latent $z^*$ that minimizes a reward similarity metric (the EPIC loss) between the intrinsic and extrinsic reward functions:

$z^* = \arg\min_z D_{\text{EPIC}} \left( R^{\text{int}}(\tau, z), R^{\text{ext}}(\tau) \right)$

This matching occurs entirely offline, using precomputed trajectory-reward tuples and without requiring additional environment rollouts.

This methodological motif—repurposing pretraining artifacts (discriminators, policy logits, or implicit reward statistics) as reward functions for downstream finetuning—recurs in other modalities. Unlike RL pipelines that compute gradients via sampled rollouts or rely on separately trained reward models, Direct Reward Finetuning often exploits reward signals derived from existing pretraining components or dense, differentiable surrogates, thereby improving efficiency and consistency.

2. EPIC Loss and Reward Matching

A distinctive feature of IRM is the use of the EPIC loss to compare reward functions. The EPIC loss is an invariant metric defined as:

$D_{\text{EPIC}}(R_A, R_B) = \mathbb{E}_{s_p, s'_p \sim D_p, S_C, S'_C \sim D_C} \left[ D_\rho( C(R_A)(s_p, s'_p, S_C, S'_C),\ C(R_B)(s_p, s'_p, S_C, S'_C) ) \right]$

Here, $C(\cdot)$ denotes a canonicalization operator ensuring invariance to affine transformations, and $D_\rho$ is a Pearson distance. By minimizing this invariant metric over the continuous skill space, IRM identifies policy primitives most semantically aligned with the extrinsic task. This approach generalizes beyond discrete skill matching, enabling fine-grained, continuous optimization without environment interaction.

The reward matching paradigm is generalizable: related works—such as direct RLHF variants in language modeling—compare intrinsic (implicit) and extrinsic (task or human feedback–provided) rewards to identify policies or skills with maximum functional similarity, often by aligning metrics at the level of distributions or mutual information.

3. Comparison with Conventional Finetuning

Direct Reward Finetuning departs from standard policy finetuning workflows that rely on costly sampling or grid search. In classic settings, pretrained skills are empirically evaluated—typically via environment rollouts or grid search in discrete/continuous latent spaces—to select the most effective skill for a downstream task. These approaches are hindered by the curse of dimensionality and high interaction costs.

IRM and related methods circumvent these problems by:

Avoiding environment rollouts during skill selection.
Enabling zero-shot or near-zero-shot adaptation via offline reward matching.
Allowing the use of a continuous (potentially uncountable) skill space for optimization, compared to the coarse discretization necessitated by rollout-based search. Numerical evidence shows that IRM achieves higher or comparable rewards to rollout-based and grid search baselines, with markedly greater sample efficiency.

4. Sequential Skill Composition for Long-Horizon Tasks

IRM generalizes naturally to the orchestration of skill sequences in long-horizon or complex tasks. The overall task horizon $H$ is partitioned into $N$ segments (e.g., corresponding to subtasks). In each segment, the intrinsic-to-extrinsic reward matching (via EPIC minimization) selects a skill tailored to the reward structure of that interval. The resulting composite policy:

$\text{Policy} = \text{Concat}\left( \pi(\cdot | s, z_1), \pi(\cdot | s, z_2), \dotsc, \pi(\cdot | s, z_N) \right)$

delivers a sequence of pretrained skills, each chosen via reward functional alignment for the associated part of the task. This modularity supports hierarchical RL and extends the utility of pretrained skill repertoires for complex manipulation and multi-stage control.

5. Empirical Performance and Application Domains

IRM has been empirically validated on robotic manipulation benchmarks. In Fetch Tabletop tasks (e.g., Reach Target, Block Push), IRM-selected skills achieved high zero-shot returns without additional policy finetuning. In the more challenging Franka Kitchen environment (including cabinet manipulation and light switch tasks), IRM frequently enabled task completion even prior to any downstream gradient updates.

Across these settings, IRM:

Outperformed rollout-based and grid search approaches in task reward, especially in continuous skill spaces where exhaustive or even coarse sampling is prohibitive.
Offered immediate deployability for real-world and sample-constrained tasks due to its interaction-free optimization.
Demonstrated superior generalization in sequential skill selection, leading to robust long-horizon behavior in multi-stage control.

These findings position Direct Reward Finetuning not only as a way to improve efficiency, but as a principled method to translate unsupervised behavior primitives into task-specialized policies in robotic, vision, and language agent pipelines.

6. Theoretical Insights and Limitations

The mathematical formulation of IRM—rooted in variational mutual information estimation and invariant reward function matching—yields several insights:

The mutual information lower-bound maximized during unsupervised skill discovery is inherently connected to the expressiveness and functional diversity of the intrinsic reward.
Canonicalization of rewards (via EPIC loss) is essential to prevent spurious alignment due to scaling or shifting, ensuring the optimization seeks functional correspondence rather than magnitude matching.
Skill discriminators, typically discarded during transfer, encode task-agnostic abstractions that can be repurposed for rapid task adaptation.

However, limitations remain. The approach assumes that the intrinsic reward functions learned during pretraining have sufficient coverage to model/edit the downstream reward structures encountered in transfer. The expressiveness and granularity of skill discriminators may also limit transfer performance in domains where the reward structure is significantly misaligned with the pretraining manifold.

7. Broader Impact and Future Directions

Direct Reward Finetuning, as implemented via IRM (Adeniji et al., 2022) and related frameworks, offers an efficient, theoretically grounded alternative to environment-intensive adaptation in reinforcement learning and skill transfer. By exploiting intrinsic reward models and invariant function matching, these methods deliver improved sample efficiency, strong zero-shot and few-shot adaptation, and scalable sequential skill composition.

Future work may investigate:

Generalizing the approach to richer sensory domains (e.g., multi-modal input, language-guided reward functions).
Extending the reward matching framework to settings with partial or nonstationary reward specification.
Integrating uncertainty quantification into reward model matching to further support robust long-horizon composition and out-of-distribution task adaptation.

These directions suggest a continued convergence of unsupervised pretraining, intrinsic reward modeling, and reward-based direct finetuning as a unified strategy for efficient and generalizable agent and policy adaptation.

PDF Markdown Chat (Pro)

References (1)

Skill-Based Reinforcement Learning with Intrinsic Reward Matching (2022)

Follow Topic

Get notified by email when new papers are published related to Direct Reward Finetuning.