Papers
Topics
Authors
Recent
Search
2000 character limit reached

Context-Aware Reward Models (CARM)

Updated 23 March 2026
  • Context-Aware Reward Models (CARM) are a modular approach that separates invariant human preferences from context-dependent saliency, enhancing data efficiency.
  • The methodology uses a two-stage learning protocol with targeted contextual queries and cross-entropy loss to optimize both saliency functions and preference weights.
  • Empirical tests in robotics reveal that CARM reduces query requirements by up to 10x while improving reward alignment and personalization in both simulated and user studies.

Context-Aware Reward Models (CARM) formalize the principle that agent behavior optimality depends not only on fixed human preferences but also on the task context, which modulates the saliency or relevance of different reward features. The fundamental innovation is to explicitly decompose reward functions into (1) context-invariant latent preference parameters and (2) context-dependent saliency functions over base features. This modular design enables more efficient generalization and sample-efficient reward learning by capturing the structure inherent in how human objectives adapt across varying task environments. The CARM paradigm, as instantiated in robotics applications, achieves order-of-magnitude reductions in data requirements and improved personalization without sacrificing transparency or interpretability (Forsey-Smerek et al., 17 Jun 2025).

1. Mathematical Formalization and Structural Decomposition

CARM specifies the full agent reward as a composition of interpretable base features, context-dependent saliency weights, and context-invariant high-level preferences:

  • State ss: Full agent state vector, sSRds \in \mathcal S \subseteq \mathbb R^d (e.g., joint angles, object positions).
  • Context cc: Subvector of ss containing variables that modulate feature saliency (e.g., stove heat, liquid fullness).
  • Base Features ϕ(s)\phi(s): Vector of interpretable, normalized features, [0,1]M[0,1]^M.
  • Saliency w(c)w(c): Context-dependent, positive-weight vector, w(c)RMw(c) \in \mathbb R^M, typically parameterized by a small neural network, mapping context to feature importances.
  • Calibrated Features ψ(s,c)\psi(s, c): Elementwise product,

ψ(s,c)=w(c)ϕ(s).\psi(s, c) = w(c) \odot \phi(s)\,.

  • Preference Parameters θ\theta: Context-invariant scalar weights, θRM\theta \in \mathbb R^M, encoding the user's trade-offs (e.g., safety vs. efficiency).
  • Final Reward:

R(s,c)=θψ(s,c)=θ[w(c)ϕ(s)].R(s, c) = \theta \cdot \psi(s, c) = \theta \cdot \left[w(c) \odot \phi(s)\right]\,.

This decomposition enables CARM to modularize adaptation: w(c)w(c) encodes environmental contingencies, while θ\theta remains constant, encoding human value priorities across all environments (Forsey-Smerek et al., 17 Jun 2025).

2. Learning Procedures: Saliency Isolation and Preference Identification

CARM employs a two-stage learning protocol to maximize sample efficiency:

Stage 1: Saliency Learning

  • For each feature ii, collect paired comparison data Dϕi\mathcal D_{\phi_i}, where human query responses indicate feature-respect preference (y{0,0.5,1}y \in \{0, 0.5, 1\}).
  • Fit wi(c)w_i(c) using a Bradley–Terry model on single-feature comparisons. For each pair (s1,s2)(s_1, s_2), the (possibly context-dependent) calibrated feature ϕψi(s)wi(c)ϕi(s)\phi_{\,\psi_i}'(s) \equiv w_i(c)\phi_i(s) is compared.
  • Cross-entropy loss, with equivalence (y=0.5y=0.5) downweighted or regularized, and small 2\ell_2 regularization to stabilize logits.

Stage 2: Preference Weight Learning

  • After fixing all wi(c)w_i(c), collect ordinary reward-preference queries Dθ\mathcal D_{\theta}, asking for full trajectory comparisons.
  • Fit θ\theta by maximizing the likelihood over the fixed, context-calibrated feature representation.

Algorithmic Outline:

  1. Randomly initialize calibrated-feature networks ψi\psi_i.
  2. For each feature: collect KiK_i contextual queries and optimize ψi\psi_i.
  3. Freeze ψi\psi_i; initialize θ\theta.
  4. Collect NN reward-preference queries, optimize θ\theta.
  5. The agent uses R(s,c)=θ[w(c)ϕ(s)]R(s, c) = \theta \cdot [w(c) \odot \phi(s)] for reward evaluation.

Sample Complexity: Each ψi\psi_i (per-feature saliency) trains on low-dimensional data, with Ki50K_i \sim 50 sufficient per feature; final θ\theta learning typically requires only N=5 ⁣ ⁣20N = 5\!-\!20 queries. CARM reaches comparable reward accuracy with up to 10x fewer queries than joint, monolithic IRL approaches (Forsey-Smerek et al., 17 Jun 2025).

3. Empirical Performance and Validation

CARM has been empirically validated in simulated and real-user robotic settings:

Simulated User Experiments:

  • Environments: PyBullet tabletop manipulation (Weighted Block, Cup, Utensil).
  • Performance: CARM needs ∼10x fewer queries than JointPref-style multi-task IRL to reach matched reward accuracy.
  • In low-data regimes (5–10 preference queries), CARM improves test-reward win rate by up to +15% over baselines.
  • Nonlinear, context–feature dependencies are accurately recovered (visualized in Fig. 3 of (Forsey-Smerek et al., 17 Jun 2025)).

User Study (N=12):

  • Domain: Personalized teaching of context preferences in the Utensil task.
  • Metrics: Alignment with user expectations (Likert 1–7), model ranking among baseline and CARM variants.
  • Outcomes: Strong query-count effect on perceived alignment (F(3,33)=19.49,p<0.001F(3,33)=19.49, p<0.001); even 25–50 queries produced large subjective gains. Substantial inter-participant diversity in inferred reward structures confirmed the CARM necessity for efficient personalization.

4. Design Guidelines and Practical Implementation

Feature and Context Selection:

  • Base features (ϕi\phi_i) must be interpretable and relevant; can be hand-crafted or generated via automated (e.g., LLM-based) approaches.
  • Context (cc) should capture state variables hypothesized a priori to modulate feature saliency. If unknown, the feature extractor MLP can be over full ss.

Query Engineering:

  • Use single-feature contextual queries with explicit equivalence (0.5) option to calibrate feature irrelevance under certain contexts.
  • Allow equivalence in reward-preference queries to avoid overfitting to noisy human labels.

Architecture and Hyperparameters:

  • Shallow MLPs (2–3 layers, 32–64 units), learning rates 10310^{-3}10210^{-2}, and 2\ell_2 loss regularization.
  • Active selection of query pairs is possible by monitoring gradient informativeness.

Sample Allocation:

  • Approximately 50–100 queries per feature for saliency learning; 5–20 for preference weights suffices for most practical applications (Forsey-Smerek et al., 17 Jun 2025).

Conventional multi-task or meta-IRL schemes confound context-dependent and context-independent variations—forcing the agent to "relearn" rewards for each context or to implicitly discover decompositions through data-hungry, high-dimensional optimization (Forsey-Smerek et al., 17 Jun 2025). CARM, by explicit modularization, reduces the effective parameter space and the sample requirements by orders of magnitude.

Prior context-aware approaches in contextual MDPs (CMDPs) focus on modeling context-dependent transition and reward dynamics, but typically assume context is unobserved or fixed per episode (Hallak et al., 2015). CARM’s explicit distinction between saliency and invariant preference brings fine-grained control to intra-task adaptation unaddressed in these frameworks. Likewise, approaches such as BCR-DRL integrate context sensitivity at the reward-weighting stage but do not modularize preference and saliency (Hao et al., 2024).

6. Theoretical and Practical Implications

CARM represents a paradigm shift toward sample-efficient, interpretable, and modular reward learning in applications where agent objectives must adapt to rapidly changing contexts but preserve core human policy objectives. Empirical validation demonstrates robust gains in both learning speed and subjective alignment with truly personalized, context-aware objectives. The separation of context-dependent saliency from invariant preference is central to the success, supporting both theoretical insight into data efficiency and practical effectiveness in robotic and interactive learning settings (Forsey-Smerek et al., 17 Jun 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context-Aware Reward Models (CARM).