Psychological Regret Model (PRM)

Updated 10 February 2026

Psychological Regret Model is a formal framework that quantifies counterfactual regret, extending expected utility theory for both human and AI decision systems.
It provides a tractable mathematical definition and computable regret signals that improve learning speed and effective reward shaping in reinforcement learning environments.
PRM bridges human risk attitudes with algorithmic policy optimization, offering practical applications in behavioral economics, robotics, and high-stakes decision-making.

The Psychological Regret Model (PRM) is a formal framework for quantifying and utilizing regret—the psychological signal associated with counterfactual comparisons between an agent’s actual action and hypothetical optimal actions—in both human and artificial decision-making systems. PRM defines regret precisely for both stochastic choice settings (lotteries, games) and sequential decision processes (reinforcement learning), capturing the adjustments to preference and learning induced by considering “what might have been.” Major theoretical advances include tractable mathematical definitions, axiomatic foundations, computable regret signals for artificial agents, and bridging human risk attitudes with algorithmic policy optimization. PRM systematically extends expected utility theory to account for regret, rationalizes empirical anomalies, and functions as the core of recent methods for accelerating reinforcement learning on sparse signals.

1. Formal Definition and Mathematical Framework

At the heart of PRM is the notion of a regret signal quantifying the deviation between the outcome/value of an action and that of the best available alternative under some reference criterion. In reinforcement learning (RL), for state $s_t$ and taken action $a_t$ , the regret signal $\Delta_t$ is given by

$\Delta_t = Q^*(s_t,a^*_t) - Q^*(s_t,a_t), \quad a^*_t=\arg\max_a Q^*(s_t,a)$

where $Q^*$ is the optimal action-value function. In practice, $Q^*$ is replaced by a pre-trained teacher network $Q_{\mathrm{opp}}$ , yielding

$\Delta_t = Q_{\mathrm{opp}}(s_t,a^{\mathrm{opp}}_t) - Q_{\mathrm{opp}}(s_t,a_t), \quad a^{\mathrm{opp}}_t=\arg\max_a Q_{\mathrm{opp}}(s_t,a)$

This $\Delta_t\geq 0$ quantifies, in value units, the distance from the chosen action to the teacher's best action (Xu, 3 Feb 2026).

In risk-based, non-sequential decisions, PRM compares two prospects (lotteries) via counterfactual regret:

$R(x,p\,;\,y,q)=\sum_{i,j} q_j\,p_i\,f\bigl(u(x_i)-u(y_j)\bigr)$

with $f$ increasing, antisymmetric and $u$ the utility function (Bardakhchyan et al., 2023, Aleksanyan et al., 2023).

Key properties include:

Transitivity: For independent lotteries, transitivity is guaranteed if $f$ is exponential, i.e., $f(t)=b(a^t-1)$ , leading to $g(t)=f(t)-f(-t)=b(a^t-a^{-t})$ .
Super-additivity: Super-additivity of $g$ is necessary to resolve empirical paradoxes such as the Allais paradox.
Extension to the Unknown: With unknown states, PRM employs a fear-discount factor $v(p_u)$ applied to utilities $u(x_i)$ , with $v(0)=1$ and $v(1)=0$ , and sets the utility of unobserved outcomes to zero, modifying regret evaluations under deep uncertainty (Liu, 2021).

2. Reward Shaping and Dense Feedback in Reinforcement Learning

In classical RL with sparse or delayed rewards, PRM provides a mechanism for reward shaping:

$r^{\mathrm{aug}}_t = r_t - \alpha\,\Delta_t$

where $\alpha>0$ is a scaling hyperparameter (Xu, 3 Feb 2026).

This shaping may be derived from the potential-based method with potential $\Phi(s)=\max_a Q_{\mathrm{opp}}(s,a)$ :

$r^{\mathrm{aug}}_t = r_t + \gamma \Phi(s_{t+1}) - \Phi(s_t)$

which ensures optimality preservation due to the telescoping nature of the adjustment. PRM thus augments episodic, sparse reward environments with dense, step-wise feedback, improving convergence speed and credit assignment in RL.

3. Algorithmic Implementation: StepScorer and Policy Optimization

PRM integrates seamlessly with contemporary RL policy optimization pipelines (e.g., PPO), as instantiated in the StepScorer algorithm. The pseudocode consists of:

Sampling actions $a_t \sim \pi_\theta(\cdot|s_t)$ and observing transitions.
Computing regret $\Delta_t$ based on the teacher $Q_{\mathrm{opp}}$ .
Modifying the observed reward: $r^{\mathrm{aug}}_t = r_t - \alpha \Delta_t$ .
Storing transitions and using $r^{\mathrm{aug}}_t$ in Generalized Advantage Estimation (GAE).
Applying standard PPO-type policy and value updates using the regret-shaped advantages (Xu, 3 Feb 2026).

Empirically, PPO + PRM solved LunarLander-v3 ≈36% faster than standard PPO (350 vs. >550 episodes to reach threshold average reward), with final mean episode reward doubling (300±20 vs. 140±15) (Xu, 3 Feb 2026). These results demonstrate that regret-based shaping accelerates learning in environments with sparse feedback.

4. Axiomatic, Behavioral, and Game-Theoretic Foundations

PRM is derived from and generalizes classical regret theory (Loomes & Sugden, Bell). The axiomatic basis includes:

Completeness, D-transitivity, strong monotonicity, continuity, and trade-off consistency for regret-augmented preferences (Liu, 2021).
For institutional games (ultimatum, Allais paradox, Savage’s omelet), transitive and super-additive regret forms recover human empirical patterns unexplained by expected utility theory (Bardakhchyan et al., 2023, Aleksanyan et al., 2023).

In strategic multi-agent contexts (e.g., Nash equilibrium of the “regret game”), regret externalities and informational dependencies create coordination games with multiple equilibria driven by anticipated regret and observability of counterfactuals (Cerrone et al., 2021).

PRM also provides rational, economically grounded explanations for phenomena including:

Punishment in mini-ultimatum games: a responder rejects an offer iff her regret from rejecting is less than the proposer’s anticipated regret from not offering the best option (Aleksanyan et al., 2023).
Paradoxical reversals under unknown risk: the introduction of “unknown” outcomes can strengthen, dampen, or reverse preferences in consistent, predictable ways (Liu, 2021).

5. Quantitative Elicitation and Human-Computer Interfaces

Eliciting the psychological parameters underlying PRM for human subjects requires careful experimental protocols. Approaches include:

Presenting subjects with calibrated sequences of choice problems with controlled probabilities and outcome differences.
Modeling preference intensities as fuzzy membership values over linguistic preference labels.
Using adaptive question sequences to resolve indifference points and back out probability-weighting and regret functions specific to individuals (Jiang et al., 2018).

Validation studies found that model prediction accuracies (76.7%) matched human revisit consistency (71.7%), and consistent-response predictions exceeded 90%, establishing that PRM is quantitatively predictive at the individual level.

6. Applications in Artificial and Human Decision Systems

The PRM framework finds utility across multiple domains:

Reinforcement Learning: Faster policy improvement in environments with sparse or delayed rewards, especially in robotics, finance, and education systems requiring rapid adaptation (Xu, 3 Feb 2026).
Human-Robot Teams: Modeling regret-sensitive human delegation improves risk-sharing and workload balancing; PRM-based queue ordering in multi-robot systems delivers emotion-aware optimizations (Jiang et al., 2019).
Behavioral Economics and Games: Resolves paradoxes (Allais), rationalizes non-classical behaviors (ultimatum rejections, coordination), and isolates features distinguishing regret-based from fairness-based and risk-only models (Aleksanyan et al., 2023, Bardakhchyan et al., 2023, Cerrone et al., 2021).
Medical Decisions and Deep Uncertainty: Models fear of the unknown and its effect on critical, high-stakes decisions (e.g., cancer treatment risk evaluation) (Liu, 2021).

7. Theoretical and Empirical Impact

PRM extends expected utility theory by (a) explicitly encoding counterfactuals and regret, (b) supporting axiomatic transitivity and stochastic dominance, (c) accommodating complex dependencies and unknowns, and (d) providing a dense, computable feedback for both human and artificial agents. It delivers quantitative predictive power at the individual and population levels and recovers classical models as limiting cases.

By formalizing human-inspired regret in a computationally tractable way, PRM constitutes a robust framework at the interface of behavioral science and machine learning, substantially broadening the range of explainable and optimizable decision problems (Xu, 3 Feb 2026, Jiang et al., 2018, Jiang et al., 2019, Liu, 2021, Aleksanyan et al., 2023, Cerrone et al., 2021, Bardakhchyan et al., 2023).