Demo-Guided Reward Calibration

Updated 3 February 2026

The paper demonstrates demonstration-guided reward calibration as a method to refine reward functions using expert trajectories and narrated episodes.
It employs techniques like contrastive ranking, self-supervised regression, and successor representations to boost sample efficiency and policy alignment.
The approach enhances task performance in robotics, continuous control, and language domains while reducing reliance on hand-specified rewards.

Demonstration-guided reward calibration refers to a class of methods that leverage expert or user demonstrations to systematically refine and calibrate reward functions for agents, with the aim of aligning behavioral policies to desired outcomes. Rather than relying solely on hand-specified rewards, these techniques use data from demonstrations—such as state–action trajectories or narrated visual episodes—to extract informative signals about task objectives, often resolving ambiguities and improving sample efficiency in reinforcement learning (RL), inverse reinforcement learning (IRL), and LLM alignment. This article surveys foundational concepts, mathematical frameworks, representative methodologies, and empirical results in demonstration-guided reward calibration across robotics, continuous control, and language domains.

1. Mathematical Foundations and Problem Formulation

Demonstration-guided reward calibration is rooted in the general reinforcement learning framework, where an agent interacts with an environment modeled as a Markov Decision Process (MDP) $(\mathcal{S}, \mathcal{A}, \mathcal{T}, \gamma)$ and seeks to maximize cumulative rewards. Classic reward specification is replaced or augmented by a reward function $r_\theta$ inferred or calibrated from demonstration data $\mathcal{D}$ , often in the form of expert trajectories $\tau = (s_0, a_0, ..., s_T, a_T)$ or paired language-visual inputs.

In the IRL paradigm, the reward function is typically optimized such that an (implicitly or explicitly) induced optimal policy $\pi^*_{r_\theta}$ matches the demonstration distribution:

$\mathcal{L}_\text{IRL}(\theta) = -\sum_{i} \log p(\tau_i | r_\theta)$

Where $p(\tau|r_\theta)$ denotes the maximum-entropy trajectory probability under reward $r_\theta$ . Extensions incorporate regularization for policy divergence (e.g., KL to a reference policy), constraints from heterogeneous demonstrators, and use of compositional language conditioning (Tung et al., 2018, Chen et al., 2020, Hwang et al., 18 Nov 2025, Zeng et al., 15 Mar 2025).

In continuous and high-dimensional domains, additional structural constraints (e.g., modularity, object-factorized states, successor representations) are imposed to promote generalization and robustness (Tung et al., 2018, Azad et al., 4 Jan 2025, Hwang et al., 18 Nov 2025).

2. Strategies for Demonstration-Guided Reward Calibration

2.1 Hard Negative Mining and Contrastive Calibration

Visual and robotics-oriented pipelines, such as "Reward Learning from Narrated Demonstrations" (Tung et al., 2018), exploit the temporal dynamics of demonstration videos. Frames immediately before goal attainment are labeled as hard negatives, while goal-achieved frames serve as positives. Optimization is performed with a contrastive ranking loss to calibrate a reward detector that is sensitive to the demonstrated relation (e.g., spatial relationship between objects):

$\mathcal{L}_\text{rank} = \sum_{k \in X^{+}, m \in X^{-}} \max\{0, m + S_\theta(\cdot)_{\text{neg}} - S_\theta(\cdot)_{\text{pos}}\}$

Phrase-specific thresholds are concurrently learned via cross-entropy, yielding a binary calibrated reward $R_\phi(l, I) \in \{0, 1\}$ . This structure extends naturally to language-grounded visual rewards by factorizing state representations and mirroring the syntactic decomposition of instructions.

2.2 Self-Supervised and Regret-Minimizing Calibration

In domains with suboptimal or noisy demonstrations, approaches like Self-Supervised Reward Regression (SSRR) (Chen et al., 2020) synthesize trajectories parameterized by controllable noise levels, fit a sigmoidal noise–performance curve, and optimize a reward regressor to match this idealized performance profile. This method yields reward correlation to ground truth as high as $r_\theta$ 0, outperforming prior methods dependent on strict pairwise preference assumptions. The synthesized reward is directly employed in policy optimization, e.g., via soft actor-critic.

2.3 Successor Representation-Based Calibration

Methods such as SR-Reward (Azad et al., 4 Jan 2025) build reward functions from the successor representation under the expert policy. The calibrated reward is given by the norm of the SR vector:

$r_\theta$ 1

Negative sampling penalizes out-of-distribution $r_\theta$ 2 pairs, ensuring that rewards are high only on states traversed by demonstrators, which naturally calibrates against distributional drift and overestimation.

2.4 Potential-Based Shaping from Demonstration Tubes

Dense dynamics-aware reward synthesis (Koprulu et al., 2024) leverages both prior (task-agnostic) experience and a small number of expert demonstrations to construct dense, dynamics-aware potentials. At every encountered state, the agent's reward is shaped via:

$r_\theta$ 3

Where $r_\theta$ 4 is a max-over-demos potential function anchoring to the closest reachable demonstration. This approach provably preserves policy invariance and empirically accelerates learning, especially under sparse extrinsic reward conditions.

3. Demonstration-Guided Reward Calibration in Language and LLMs

LLM alignment has adapted demonstration-guided reward calibration as an alternative to preference-based RLHF (Reinforcement Learning from Human Feedback).

Reward Calibration from Demonstration (RCfD) (Rita et al., 2024) defines a calibrated reward as the negative squared distance between reward model scores for model outputs and demonstration continuations:

$r_\theta$ 5

This shifts the optimization focus from maximizing absolute reward to matching the reward distribution of human demonstrations, mitigating reward over-optimization and obviating complex KL regularization tuning.

Self-Rewarding PPO (SRPPO) (Zhang et al., 24 Oct 2025) employs the log-likelihood ratio between an SFT policy and the pretrained base model as the reward:

$r_\theta$ 6

This potential-based approach enables coherent on-policy updates, preserves the optimal policy set, and achieves superior alignment in low-data and out-of-domain scenarios compared to off-policy methods.

IRL Alignment from Demonstrations (Zeng et al., 15 Mar 2025) frames the LLM reward model as a bi-level optimization problem solved purely from demonstration data, using synthetic preference generation (Bradley–Terry loss) to calibrate rewards. Empirical results show comparable or superior MT-Bench and leaderboard scores to RLHF-based pipelines, despite not requiring explicit preference data.

4. Algorithmic Structure and Pseudocode

Common demonstration-guided reward calibration pipelines follow a staged approach:

Data extraction: Parse demonstration data into structured format (e.g., positive/hard negative frames, state–action trajectories, narrated clips).
Reward function parametrization: Select network architecture for $r_\theta$ 7, potentially modular or factored, and initialize with prior knowledge or pretraining.
Label generation and loss computation: Assign positive/negative labels, generate synthetic preferences or calibration targets, sample OOD negative examples for conservative calibration.
Optimization: Use contrastive, regression, or adversarial losses (possibly with regularization) to fit $r_\theta$ 8.
Policy training: Integrate the calibrated reward into standard RL solvers (e.g., Q-learning, PPO, SAC, DPO), often with reward shaping or joint state–language input.
Evaluation: Assess reward–ground truth correlation, sample efficiency, generalization to novel objects/configurations, alignment with human judgment, or end-task success.

5. Empirical Evaluation and Impact

Demonstration-guided calibration methods have been validated across diverse domains and tasks.

Robotics and control: Agents trained with calibrated visual detectors and object-factorized policies generalize to novel objects and achieve robust performance with few demonstrations (Tung et al., 2018). SR-Reward matches offline RL baselines obtained with true reward access in D4RL and Adroit manipulation tasks (Azad et al., 4 Jan 2025); dynamics-aware shaping yields $r_\theta$ 9– $\mathcal{D}$ 0 gains in sample efficiency and success rates $\mathcal{D}$ 199% with $\mathcal{D}$ 2– $\mathcal{D}$ 3 demonstrations (Koprulu et al., 2024).
LLMs: RCfD reduces reward over-optimization (KL divergence to demo reward distributions $\mathcal{D}$ 40.04 vs 1.07 for unregularized PPO), maintains naturalness without extensive hyperparameter tuning, and directly targets multi-objective tradeoffs defined by demonstrations (Rita et al., 2024). Self-Rewarding PPO achieves average win-rate improvements of $\mathcal{D}$ 5– $\mathcal{D}$ 6 points over SFT, especially in regimes with limited demonstrations (Zhang et al., 24 Oct 2025). IRL-based alignment matches or exceeds SFT and preference-based RLHF on benchmarks and crowdsourced model preference datasets (Zeng et al., 15 Mar 2025).
Theoretical robustness: Several frameworks guarantee policy invariance (potential-based shaping), exploit compositional grounding for language generalization, and are robust to noisy or suboptimal demonstrations (via self-supervision and negative sampling). Fitting the human rationality parameter in probabilistic models empirically reduces posterior regret and improves reward accuracy in user studies (Ghosal et al., 2022).
Data efficiency: Explicit mechanisms such as mask inference in Masked IRL permit up to $\mathcal{D}$ 7 reductions in demonstration count for a fixed win rate, and the use of binary invariance regularization yields lower regret and reward variance under distributional shift (Hwang et al., 18 Nov 2025).

6. Challenges, Limitations, and Open Directions

Key limitations and areas for further investigation include:

Assumptions on demonstration quality: Methods presupposing accurate temporal alignment, object-factorization, or coverage may underperform when demonstrations are extremely noisy or misaligned with actual task goals. SSRR and related approaches are sensitive to the adequacy of the initial IRL reward estimator and noise–performance curve modeling (Chen et al., 2020).
Negative sampling calibration: Domain-specific tuning of perturbation amplitude and decay for negative samples is often required. Over-conservatism can limit exploration or fail to extrapolate beyond demonstration coverage (Azad et al., 4 Jan 2025).
Sparse or high-dimensional observations: Extending calibration methods from state features to raw images or high-dimensional language remains an area of active research, with representation learning pipelines and Masked IRL–style invariance regularization representing plausible approaches (Hwang et al., 18 Nov 2025).
Handling ambiguous or heterogeneous demonstrations: Disambiguation via language and state-relevance masks or modeling per-strategy reward corrections (as in multistyle reward distillation) directly addresses reward ambiguity and demonstrator heterogeneity, but computational complexity grows with the number of strategies or ambiguity resolution steps (Chen et al., 2020, Hwang et al., 18 Nov 2025).
Computational cost and active query selection: Query generation and joint optimization are often non-convex, data- and compute-intensive. Scalability to real-time or human-in-the-loop settings and further automating calibration procedures (e.g., rationality fitting, mask inference) remain open problems.

Demonstration-guided reward calibration interfaces with a range of auxiliary methodologies:

Preference-based reward learning with demonstration-grounded priors and active query design increases sample efficiency by using demonstrations both for initialization and for constraining query selection (Palan et al., 2019).
Potential-based reward shaping (PBRS) offers strong theoretical guarantees for reward function augmentation, preserving the set of optimal policies and providing a flexible mechanism for integrating demonstration-informed potentials (Koprulu et al., 2024, Zhang et al., 24 Oct 2025).
Curiosity-driven reward augmentation and hybrid intrinsic–extrinsic reward blending have demonstrated improved exploration with minimal demonstration data (Chen et al., 2020).
Language-conditioned and modular reward detectors facilitate compositional policy generalization, enabling policies to handle previously unseen objects, relations, or instructions (Tung et al., 2018, Hwang et al., 18 Nov 2025).

Demonstration-guided reward calibration thus forms a foundational pillar in scalable, aligned, and sample-efficient RL and LLM alignment, providing both theoretical robustness and strong empirical utility across modalities and application domains.