Reinforcement Learning from Human Brain (RLHB)

Updated 24 January 2026

Reinforcement Learning from Human Brain (RLHB) is a framework that integrates neural signals from fMRI and EEG into the reward mechanism for training AI models.
It leverages time-resolved neural feedback across perception, valuation, execution, and integration levels to capture latent cognitive variables.
The approach enhances model alignment and robustness by directly mapping biological signals to reward functions while addressing ethical and scalability challenges.

Reinforcement Learning from Human Brain (RLHB) denotes a strategy and associated framework for integrating direct neural feedback, as measured by neuroimaging modalities such as fMRI and EEG, into the reward pipeline for training modern foundation models. Unlike traditional Reinforcement Learning from Human Feedback (RLHF), which utilizes post hoc behavioral ratings or explicit preference labels derived from observable actions, RLHB leverages time-resolved neural signals acquired as human evaluators process or appraise model-generated outputs. The paradigm is grounded in the hypothesis that brain-derived data captures latent cognitive variables—such as value, intention, or confidence—not explicitly accessible via conventional action-based feedback, thereby providing a direct window into otherwise inaccessible aspects of human evaluation and cognition (Donoso, 17 Jan 2026).

1. Theoretical Motivation and Cognitive Framework

The principal motivation for RLHB arises from the generative brain principle articulated as $A(t)\subset B^{*}(t)\subset B(t)$ , where $A(t)$ denotes observable human actions (e.g., text, spoken responses), $B(t)$ the total neural generative process underlying behavior, and $B^{*}(t)$ the neuroimaging-accessible, time-resolved neural activity. Current foundation models exclusively exploit $A(t)$ , which only partially projects the richness of internal cognitive evaluation. RLHB proposes leveraging $B^{*}(t)$ to bridge the gap between observable outcomes and the latent variables underpinning valuation, perception, executive function, and conceptual integration.

A four-level taxonomy links model limitations, neuroanatomical substrates, and relevant neuroimaging signals:

Perception Level: Models' out-of-distribution (OOD) fragility is mapped to primary sensory cortices. Neural signals include fMRI BOLD in visual cortex and event-related potentials (e.g., N170 for faces), usable as pattern-decoding regularizers to align intermediate representations.
Valuation Level: Alignment gaps in RLHF relate to valuation circuits (e.g., ventral striatum, vmPFC, OFC), with relevant signals such as fMRI reward-prediction error or EEG feedback-related negativity (FRN). RLHB substitutes or augments $r_{human}$ in RLHF with real-time neural feedback $r_{neural}$ .
Execution Level: Deficits in working memory and planning are associated with frontoparietal networks (DLPFC, ACC, FPC), where fMRI or EEG captures rule-tracking and task-switching signals.
Integration Level: Hallucinations and shallow grounding are linked to distributed language and semantic areas (IFG, TPJ, mPFC), with high-gamma ECoG or fMRI semantic signatures informing deep conceptual embedding alignment.

2. RLHB: Core Algorithmic Construct

The RLHB method generalizes standard reinforcement learning (RL) by incorporating real-time neural signals into the reward computation:

$r_t = \alpha \cdot r_{task}(y_t) + \beta \cdot f(n_t)$

Here, $r_{task}$ is the explicit task-based reward (e.g., correctness of a response), $n_t$ is the neural measurement (e.g., vmPFC BOLD or EEG FRN) associated with the model's output $y_t$ , and $f(n)$ is a neural decoding function transforming raw neural data into a normalized reward scalar. Model parameters $\theta$ (or policy $\pi_\theta$ ) are updated via policy gradient methods such as Proximal Policy Optimization (PPO) to maximize the expected total reward:

$\theta \leftarrow \theta + \eta \nabla_\theta \mathbb{E}\left[\sum_t r_t\right]$

The practical RLHB loop comprises:

Model generates an output $y_t$ for input $x_t$ .
Human subject evaluates $y_t$ under controlled neuroimaging; $n_t$ is recorded.
Scalar reward $r_t$ is computed, integrating model performance and neural feedback.
Model parameters are updated to maximize $\sum_t r_t$ .

Sample-efficiency can be enhanced by:

Active learning: prioritizing neuroimaging for high-uncertainty, high-variance outputs.
Offline training of the neural decoder $f(\cdot)$ , then extrapolating synthetic neural labels to scale up feedback.
Allocation of neuroimaging budget using multi-armed bandit allocation to select maximally informative outputs (Donoso, 17 Jan 2026).

3. Neural Feedback as Reward Signal: Modalities, Decoding, and Practical Aspects

RLHB operationalizes the valuation level of cognition, tightly coupling evaluation regions' neural signatures to machine learning reward shaping. Common implementations include:

fMRI-Based Feedback: Decoding reward-prediction errors from ventral striatum or value/confidence signals from vmPFC BOLD; these can be linearly or nonlinearly mapped to reward increments.
EEG-Based Feedback: FRN amplitude serves as an immediate, trial-by-trial proxy for negative reward. This may be combined with temporal markers of confidence or error processing.
Combined Modalities: Multi-modal fusions (e.g., EEG+fMRI or EEG+ECoG) can enhance robustness and sample throughput, with cross-modal neural translation to maximize data efficiency.

A neural-decoding model $f(n)$ is often pretrained in a supervised manner, mapping raw neurophysiological signals to ground-truth reward trace labels prior to integration in RLHB.

4. Comparison with RLHF and Topological Impact

Traditional RLHF relies on explicit, consciously reported (or crowdsourced) preference labels, which may be sparse, noisy, or subject to social desirability biases. RLHB, in contrast, accesses the rapid, low-latency computations of reward circuitry, potentially bypassing conscious self-censorship and unlocking richer, sub-behavioral value information.

RLHB can, in principle, expose hidden misalignment or value drift undetectable via action-based feedback.
By integrating neural and classical feedback—either additively or as model selection criteria—RLHB facilitates the learning of value functions that are both behaviorally anchored and neurobiologically constrained.
Empirical correlation metrics, such as brain score (correlation between held-out predicted and actual neural signals), and downstream robustness generalization ( $\Delta Acc$ ), are proposed for evaluating the cognitive fidelity and sample efficiency of RLHB-trained policies (Donoso, 17 Jan 2026).

5. Extensions: CoTHB, Multi-Objective Integration, and Broader Implications

The RLHB paradigm is closely related to Chain of Thought from Human Brain (CoTHB). Whereas RLHB supplies summary score signals per episode, CoTHB leverages fine-grained, temporally localized neural markers of reasoning steps (e.g., EEG markers in DLPFC during multi-step arithmetic) to guide latent model trajectories. Integration of RLHB with classical cross-entropy tasks yields composite pretraining objectives:

$\mathcal{L}_{total} = \mathcal{L}_{task} + \lambda \mathcal{L}_{neural}$

$\mathcal{L}_{neural} = \mathbb{E}_{B^*(t)}\left[\|h_\theta(x) - z_{neural}(B^*(t))\|^2\right]$

This supports concurrent alignment with both explicit targets and neural representations, trading off interpretability and performance via hyperparameter $\lambda$ (Donoso, 17 Jan 2026).

From an AGI/ASI perspective, RLHB is positioned as a route to "functional AGI"—systems whose perception, valuation, and reasoning more closely reflect the latent processes of human cognition. Moreover, the preservation of neural similarity metrics may provide a path to controlling alignment even in systems that strongly generalize beyond the statistical envelope of language or vision benchmarks.

6. Challenges, Limitations, and Future Research Directions

Key open problems and constraints for RLHB include:

Privacy and Ethics: All forms of neural data are considered highly sensitive; robust anonymization, informed consent, and optionality must be enforced. Concerns about "freedom of thought" and unintended mind-reading are prominent.
Signal Variability and Alignment: Neural data is idiosyncratic across individuals; cross-subject hyperalignment and domain adaptation are nontrivial. Risks of overfitting to noise or adversarial brain states remain significant.
Sample Efficiency and Scalability: Neuroimaging is costly and low-throughput; optimizing the information gain per scan and leveraging pretraining or transfer methodologies is essential.
Demographic Bias and Generalizability: Most neuroimaging corpora are demographically imbalanced, raising risks for bias amplification and inequitable model behaviors.

Proposed future directions include real-time RLHB closed-loop training via wearable EEG, synthetic neural labels via cross-modal mapping, incorporation of diverse imaging modalities (MEG, ECoG, fNIRS), and continual brain-supervised learning protocols (Donoso, 17 Jan 2026).

7. Significance and Outlook

RLHB extends the landscape of foundation model alignment by infusing training protocols with principled, biologically grounded value signals, moving beyond the inherent constraints of action-based or explicit preference learning. This avenue holds the potential to remedy alignment failures rooted in shallow behavioral proxies, improve robustness to adversarial distribution shifts, and enable foundation models whose behavioral and cognitive dynamics more faithfully reflect the underlying structure of human thought and evaluation (Donoso, 17 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

A New Strategy for Artificial Intelligence: Training Foundation Models Directly on Human Brain Data (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Reinforcement Learning from Human Brain (RLHB).