Error-Related Potentials as Scalar Rewards

Updated 25 December 2025

The paper establishes a conversion pipeline that maps EEG error-detection outputs into continuous evaluative feedback for RL agent optimization.
Methodologies involve robust signal processing, deep learning classifiers, and mathematical mappings to translate ErrP probabilities into scalar rewards.
Empirical findings highlight accelerated learning and improved performance across diverse domains, supporting scalable human-centered RL frameworks.

Error-related potentials (ErrPs) are event-related EEG signatures elicited when a human observer perceives another agent’s (human or artificial) error. Recent advances have established ErrPs as a source of implicit feedback in @@@@1@@@@ (RL) and adaptive brain-computer interface (BCI) systems. Critically, ErrPs can be decoded into scalar reward signals—here, “ErrP scalar rewards”—providing a dense, implicit link between human evaluative judgment and artificial agent optimization. This article comprehensively surveys the core methodologies, conversion pipelines, mathematical mappings, and empirical outcomes underpinning the use of ErrPs as scalar rewards in interactive RL and BCI systems.

1. Decoding ErrPs: Signal Processing and Classification Architectures

Converting EEG-recorded ErrPs into a machine-usable reward signal requires robust detection and quantification pipelines. Most frameworks begin with band-pass filtering (e.g., 1–20 Hz), downsampling (e.g., 256 Hz or lower), segmentation into time epochs aligned to agent actions (e.g., 0–2 s post-action), and spatial re-referencing. Feature extraction spans from discrete wavelet transforms (DWTs) and spatial filters (xDAWN, Fisher projections) (Spinnato et al., 2015, Xu et al., 2020) to end-to-end deep learning via EEGNet architectures (Kim et al., 17 Jul 2025, Kim et al., 24 Nov 2025, Akinola et al., 2019).

Classifier architectures fall into two main families:

Statistical models with handcrafted or engineered features: Linear mixed effects in wavelet space (Spinnato et al., 2015), regularized logistic regression on Riemannian tangent-space features (Xu et al., 2020), Fisher discriminant projections.
Deep neural architectures: EEGNet [Lawhern et al.], with temporal, spatial, and separable convolutions, batch normalization, ELU activations, dropout, and final softmax or sigmoid outputs (Kim et al., 17 Jul 2025, Kim et al., 24 Nov 2025, Akinola et al., 2019).

Training utilizes LOSO cross-validation across subjects and epochs (typically 200–2,000 trials/subject), with mean balanced accuracies ranging from 70–90% in recent works (Kim et al., 17 Jul 2025, Kim et al., 24 Nov 2025). Even decoders with marginal improvements over chance (∼60%) produce feedback sufficient for policy shaping (Kim et al., 17 Jul 2025). Calibration procedures, including Platt scaling on held-out sets, ensure posterior-probability outputs are well-matched to empirical trialwise error rates (Xu et al., 2020, Spinnato et al., 2015).

2. Mathematical Mapping: From Classifier Output to Scalar Reward

The conversion from ErrP-detection output to scalar reward follows explicit mappings, designed to provide the RL agent with a continuous evaluative signal reflecting human-perceived correctness. The most widespread mapping is affine or linear:

$r_t^{\text{ErrP}} = 1 - p_{\text{ErrP}, t}$

where $p_{\text{ErrP}, t}$ is the classifier-derived probability that an error was perceived by the human at timestep $t$ (Kim et al., 17 Jul 2025). High $p_{\text{ErrP}, t}$ (confident error detection) suppresses reward; low $p_{\text{ErrP}, t}$ provides maximal evaluative reward.

Other mappings include:

Posterior-based reward: $r(y) = P(C=1|y)$ , where $C=1$ denotes error; for symmetric ranges, $r(y) = 2P(C=1|y)-1 \in [-1, 1]$ (Spinnato et al., 2015).
Log-odds shaping: $r(y) = \tanh(\alpha \, \delta(y))$ , where $\delta(y)$ is log-likelihood ratio, and $\alpha$ is a scaling parameter.
Thresholded or discrete mapping: $r^{\text{ErrP}}_t = -1$ if $p_t$ exceeds a threshold; otherwise, 0 (Xu et al., 2020).

The reward output can be further conditioned by calibration and normalization steps to align predicted “error likelihood” to the actual detection frequency (Spinnato et al., 2015, Xu et al., 2020).

3. Reward Fusion: Integration into RL Objective Functions

Scalar ErrP reward can be injected directly or in regularized mixtures with environmental rewards depending on application demands and stability considerations: $r_t = r^{\mathrm{env}}_t + \lambda \, r_t^{\mathrm{ErrP}}$ with $\lambda \in (0,1]$ controlling the weighting between implicit human feedback and external/task-based reward (Kim et al., 24 Nov 2025, Kim et al., 17 Jul 2025, Xu et al., 2020).

In some frameworks, more elaborate shaping strategies are used to mitigate the effects of classifier noise or to embody human evaluative priors:

Auxiliary Q-functions $Q_h(s, a)$ : Human-labeled demonstrations, replay buffers, and maximum entropy inverse RL (IRL) fitting to form potential-based shaping rewards (Xu et al., 2020).
Contextual bandits: Reward is defined per trial as $r_t = 1 - \mathrm{ErrP}_t$ , with bandit algorithms (LinUCB, NeuralUCB) updating on every trial (Fidêncio et al., 25 Feb 2025).
Actor-critic and policy-mixing: Sampling from policies shaped by human feedback for guided exploration, but estimating advantage and policy gradients exclusively on environmental rewards (Akinola et al., 2019).

No smoothing, additional normalization, or time-averaging is typically applied; the fusion is performed additively and per time step (Kim et al., 17 Jul 2025, Kim et al., 24 Nov 2025).

4. RL Algorithms and Domain-Specific Pipelines

ErrP scalar reward integration requires RL algorithms robust to stochastic, noisy, and nonstationary reward signals. Leading frameworks deploy:

Soft Actor-Critic (SAC): Off-policy, entropy-regularized RL supporting continuous actions, robust to reward variance, and used as the default in both high- and low-dimensional robotic manipulation (Kim et al., 17 Jul 2025, Kim et al., 24 Nov 2025).
Bayesian DQN and Policy Gradient methods: Used in discrete grid-world navigation or game-based scenarios, with reward-modified Q-learning updates (Xu et al., 2020, Akinola et al., 2019).
Contextual Bandits (LinUCB, NeuralUCB): For real-time BCI adaptation, with implicit ErrP-based rewards serving as trialwise learning signals (Fidêncio et al., 25 Feb 2025).

Task settings range from MuJoCo-based robotic pick-and-place and high-DOF arm manipulation (Kinova Gen2/Gen3; 7-DoF) (Kim et al., 17 Jul 2025, Kim et al., 24 Nov 2025), through grid-world navigation, to adaptive MI/BCI closed-loop control (Xu et al., 2020, Fidêncio et al., 25 Feb 2025). Standard experimental regimes involve offline training of the ErrP decoder, subsequent fixed decoding for scalar reward computation, and online RL with composite reward signals.

5. Empirical Performance, Sample Efficiency, and Cross-Domain Robustness

Across published studies, ErrP-derived scalar reward has achieved:

Dense baseline parity: RL agents using implicit ErrP rewards consistently match agents trained with high-information dense rewards, substantially outperforming those trained on sparse, environmental rewards alone (Kim et al., 17 Jul 2025, Kim et al., 24 Nov 2025, Akinola et al., 2019).
Accelerated learning: Sample efficiency increases by 1.8–2.2× in navigation/grid-game domains, and by ≈40% in high-DOF manipulation scenarios, when compared to sparse-only RL (Xu et al., 2020, Kim et al., 24 Nov 2025).
Robustness to decoder accuracy: Performance remains above sparse baseline even when decoders achieve only 60–70% accuracy (Kim et al., 17 Jul 2025, Akinola et al., 2019). Fine-tuned weighting ( $\lambda$ , $w_{hf}$ ) can further stabilize and optimize learning (Kim et al., 24 Nov 2025).
Zero-shot transfer: ErrP decoders trained on one domain generalize to other MDPs and maintain >80% of AUC/feedback efficacy (Xu et al., 2020).
Statistical significance: In manipulation tasks, improvements in episodic return, success rate, and path efficiency were confirmed across multiple subjects and random seeds ( $p<0.05$ , $p<0.01$ ) (Kim et al., 24 Nov 2025).

A summary excerpt:

Study	Scenario	Learning Acceleration	Baseline Parity?	Robustness to ErrP Noise
(Kim et al., 17 Jul 2025)	Robotic Pick&Place	>10% over sparse	Yes	Yes (70–90%)
(Kim et al., 24 Nov 2025)	7-DoF Manipulation	≈40% fewer samples	Yes	Yes (down to 60%)
(Xu et al., 2020)	2D Grid Games	1.8–2.2× faster	Yes	Yes (full-access fails, shaping stable)
(Fidêncio et al., 25 Feb 2025)	Adaptive BCI (bandit)	Significant per-Wilcoxon	N/A (MI-limited)	Adaptivity confirmed
(Akinola et al., 2019)	Mobile Navigation	Matches dense reward	Yes	Useful at 0.60–0.67 acc.

6. Limitations and Practical Considerations

ErrP classifier accuracy: The weakest decoders or subjects without clear ErrP signatures may not sustain RL improvements. Domain-specific subject adaptation and feature engineering can partly mitigate this (Kim et al., 17 Jul 2025, Akinola et al., 2019).
Credit assignment: ErrP reward gives real-time, action-aligned evaluative signals but does not address credit assignment for delayed outcomes; fusion with sparse terminal rewards is essential (Kim et al., 17 Jul 2025).
Task complexity and cognitive load: In fast-paced MI/BCI or high-DOF tasks, cognitive and attentional demands can suppress ErrP yield or MI separability, limiting reward signal informativeness (Fidêncio et al., 25 Feb 2025).
Integration scope: Many frameworks require a distinct offline decoder training phase and may not immediately support end-to-end closed-loop deployment (Kim et al., 17 Jul 2025, Spinnato et al., 2015).
Reward aggregation: Overweighting the ErrP term ( $\lambda,\,w_{hf}\geq 0.4$ ) can lead to cautious or conservative policies; moderate weighting optimizes learning speed and outcome (Kim et al., 24 Nov 2025, Kim et al., 17 Jul 2025).

7. Implications and Future Directions

ErrP signals offer a scalable, implicit, and dense feedback mechanism for RL in domains where reward design is difficult or true rewards are uninformative/sparse. Dense ErrP-based reward shaping efficiently bridges the gap between human intuitive evaluation and algorithmic learning, unlocking sample efficiency gains, cross-domain transfer, and robust convergence in high-dimensional control. Promising future directions include:

Online co-adaptation: Closing the loop with adaptive ErrP classifiers and nonstationarity tracking in real time (Fidêncio et al., 25 Feb 2025).
Fine-grained and multi-dimensional feedback: Moving beyond binary error perception to finer or multi-class evaluative signals (Spinnato et al., 2015).
Active label acquisition: Reducing calibration burden via active learning strategies and improved BCI hardware (Akinola et al., 2019).
Broader application domains: Extension to assistive robots, prosthetics, and explainable RL agents in human-centric environments.

ErrP scalar reward is thus emerging as a key component of “reinforcement learning from implicit human feedback” frameworks, supporting efficient, interpretable, and human-aligned policy learning across the interface of brains and machines (Kim et al., 17 Jul 2025, Kim et al., 24 Nov 2025, Xu et al., 2020, Akinola et al., 2019, Fidêncio et al., 25 Feb 2025, Spinnato et al., 2015).