Preference Oracle & Reward Function

Updated 19 September 2025

Preference Oracle and Reward Function is a framework that uses pairwise comparisons to infer and refine incomplete reward specifications in reinforcement learning.
It leverages targeted query strategies like Halve Largest Gap and Current Solution to minimize regret by focusing on the most impactful reward uncertainties.
Empirical results indicate that this approach achieves near-optimal policies with minimal queries, reducing the need for full reward function specification.

A preference oracle in the context of reinforcement learning and sequential decision making is an entity (often a human or interactive system) that provides comparative information or feedback—typically pairwise preferences—about which of two (or more) outcomes, actions, or trajectories is more desirable. The reward function is the mathematical object that quantifies the desirability of each state–action (or trajectory) pair and is central to optimal policy computation. In settings where the reward is cognitively complex or ill-defined, preference oracles serve as a tractable mechanism for eliciting reward information, supporting the synthesis of robust decision-making strategies in the face of reward function uncertainty or incompleteness. This article details the integration and algorithmic use of preference oracles for reward specification and policy computation, focusing particularly on the regret-based, bounded-query approach for Markov decision processes (MDPs) (Regan et al., 2012).

1. Preference Elicitation and Reward Function Uncertainty

Defining a reward function in practical applications—especially for large or complex MDPs—is often challenging since users may not be able to directly and precisely specify a numerical value for every (state, action) pair. Preference elicitation reframes the problem: instead of requiring a fully-specified reward function $r(s, a)$ , users provide partial information—often as interval bounds—such that the true reward lies within a feasible set $R$ , typically a convex polytope defined by constraints like $\underline{r} \leq r(s, a) \leq \overline{r}$ for each $(s, a)$ .

In this framework:

The reward function is only partially specified, with uncertainty region $R$ .
The goal is to make policies robust to this uncertainty by actively eliciting additional reward information only where needed, rather than requiring specification a priori of all entries of the reward table.
Preference oracles interactively tighten the bounds of the most crucial parameters through queries during policy design.

2. Minimax Regret: Robust Policy Computation under Partial Rewards

To optimize under reward uncertainty, the minimax regret criterion is used. Regret for policy $f$ (in visitation frequencies) and reward $r \in R$ is defined as: $R(f, r) = \max_{g \in F} (r \cdot g - r \cdot f)$ where $F$ is the set of valid visitation frequency vectors corresponding to feasible policies, and $g$ is the adversarial (alternative) policy. The maximum regret of policy $f$ (over rewards in $R$ ) is: $MR(f, R) = \max_{r \in R} R(f, r)$ The minimax regret over all feasible policies is then: $MMR(R) = \min_{f \in F} MR(f, R)$

This approach enables robust policy deliberation with incomplete reward knowledge: one seeks a policy whose worst-case loss, relative to the optimal policy for any possible realization of $r \in R$ , is minimized.

The methodological and practical importance of this is twofold:

It dramatically reduces the need for full reward function specification.
It ensures a computed policy is robustly near-optimal for any reward in the feasible set, hedging against the true but unknown underlying reward structure.

3. Query Strategies: Efficient Reward Elicitation via Bound Queries

Since the regret for a policy is only sensitive to certain parameters, policies can be made robust by intelligently querying the preference oracle. Two heuristics exploit this:

Halve Largest Gap (HLG): Query the $(s, a)$ pair with the largest width in its uncertainty interval. The query is of the form "Is $r(s, a) \geq b$ ?" for $b$ the interval midpoint. Formally,

$A(s, a) = \max \{ r(s, a) \} - \min \{ r(s, a) \}$

and select $(s,a) = \arg\max A(s, a)$ .

Current Solution (CS): Weight each interval gap by its importance in either the current minimax regret solution $f$ or in the adversarial witness $g$ . The next query is selected as:

$(s^*, a^*) = \arg\max \left[ \max \big( f(s^*, a^*) A(s^*, a^*),\; g(s^*, a^*) A(s^*, a^*) \big) \right]$

After each query, the oracle's answer reduces the interval by half, thus maximizing information gain where it will most rapidly reduce the policy’s maximum regret.

This algorithmic process is "anytime": early queries quickly decrease regret, and the elicitation can stop whenever the user or designer is satisfied with the regret bound.

4. Mathematical Formulation and Optimization Architecture

The overall workflow consists of two nested optimization problems:

Master problem: Optimize the policy frequencies $f$ to minimize maximum regret, subject to policy feasibility and accumulated constraints from observed or elicited reward bounds.

$\min_{f \in F} \epsilon \quad \text{subject to } r \cdot g - r \cdot f \leq \epsilon \quad \forall (g, r)\text{ in constraint set}$

Subproblem: Identify the most violated constraint (the policy/reward pair that demonstrates maximal regret for the current $f$ ) via a mixed-integer program (which, optionally, can be relaxed for improved computational scaling).

Key constraints include the reward polytope, the MDP’s stochastic dynamics, visitation frequencies, and reward-bound compliance.

5. Empirical Results and Efficiency

Empirical studies—both on synthetic MDPs and in an autonomic computing use case—demonstrate:

The minimax regret method (especially with the CS query heuristic) closes the regret gap substantially faster—requiring fewer than two queries per reward parameter—even when the total number of parameters is large.
Only a small fraction of state–action pairs needs to be queried (e.g., less than 12% in an autonomic system setting) to reach near-optimality, indicating high sample efficiency.
Constraint generation shows strong early-stage improvement: even before the final regret is driven to zero, most of the "hard" improvement is achieved with a small batch of judicious queries.

The table below summarizes empirical behavior:

Query Strategy	Queries per Parameter	Minimax Regret Reduction	State–Action Pairs Queried (%)
MMR-CS (proposed)	< 2	Rapid to zero	< 12
Maximin-based	Higher	Slower	Larger fraction

6. Practical Implications and Limitations

User burden: The bounded, preference-based queries minimize demands on domain experts by isolating only high-impact uncertainties. Instead of needing to precisely specify all rewards, the designer needs only answer straightforward yes/no questions regarding the most influential intervals.

Robustness: The computed policy offers bounded loss against an omniscient policy, ensuring reliability even under partial reward information.

Scalability: While the constraint generation and MIP-based subproblems can become computationally intensive for very large MDPs, linear relaxations and early stopping enable practical deployment for mid-sized, structured domains.

Anytime property: The method provides rapidly improving suboptimality bounds, allowing early termination with actionable policies if further query or computational budget is exhausted.

Domain scope: Especially valuable in distributed/complex environments (e.g., autonomic resource management) where exact numerical specification is infeasible, and where reward uncertainties are often local or modular and can be refined incrementally.

7. Summary

Framing reward specification as a preference elicitation problem in MDPs offers a systematic, efficient, and robust means to policy design under reward function uncertainty. By leveraging minimax regret as an optimization criterion and using preference oracles to answer targeted bound queries, one achieves near-optimal policies with strictly limited, task-focused user input. The process is mathematically underpinned by robust optimization and convex programming, and empirical evidence attests to significant reductions in both required queries and regret compared to conventional approaches. This paradigm is broadly applicable in any domain where precise reward specification is burdensome and can serve as a blueprint for interactive, preference-driven reinforcement learning system design (Regan et al., 2012).

PDF Markdown Chat (Pro)

References (1)

Regret-based Reward Elicitation for Markov Decision Processes (2012)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Preference Oracle and Reward Function.