History-Passing Reinforcement (HPR)

Updated 4 July 2026

History-Passing Reinforcement is a method where information from earlier interactions is explicitly passed forward to shape later decisions and state updates.
It employs diverse carriers such as explicit trajectory states, recurrent summaries, and counterfactual baselines to integrate historical data into model predictions.
HPR has been applied in majority dynamics, dialogue systems, and control tasks, demonstrating improved credit assignment and adaptive performance across RL applications.

History-Passing Reinforcement (HPR) denotes reinforcement-learning-style methods in which information generated at earlier interaction steps is explicitly passed forward and allowed to influence later decisions, latent-state updates, or temporal credit assignment. In the literature considered here, the label is used explicitly for a trajectory-level message-passing algorithm that searches for rare initial conditions in majority dynamics on random regular graphs (Jankola et al., 17 Dec 2025). Closely related mechanisms appear under other names in visual dialog, partially observed control, retrieval-augmented generation, black-box jailbreaking, inverse reinforcement learning, and in-context reinforcement learning, where prior answers, observation histories, retrieval traces, latent summaries, or filtered learning histories are carried into later inference or learning (Yang et al., 2019, Fotias et al., 4 May 2026, Zhang et al., 3 Feb 2026, Yoon et al., 6 Feb 2026, Patil et al., 2022, Ke et al., 22 Jan 2025, Chen et al., 21 May 2025).

1. Scope and defining properties

HPR-like methods are distinguished by the treatment of history as an operational variable rather than as incidental context. In these methods, earlier events are not merely observed and forgotten; they are reintroduced into later computation as part of the effective state, as a recurrent summary, as a counterfactual baseline, or as a structured trajectory object. The shared design pattern is that downstream behavior depends on how prior information is passed forward.

Within this scope, the literature is heterogeneous. Some works pass forward explicit histories in the state, such as retrieval histories in multi-hop RAG or rolling well-observation buffers in CO $_2$ storage control. Others pass forward compressed summaries, such as recursively updated information states, logistic context summaries, or predecessor-feature vectors. A different subgroup uses counterfactual history passing: an earlier action is deliberately inserted into future history, and the resulting downstream degradation or preservation is used as the reinforcement signal. These variations are mechanistically aligned, but they do not form a single standardized algorithmic family.

Two important boundaries appear repeatedly. First, some history-aware methods are only HPR-adjacent because they use history at a meta-level rather than inside the policy state. The framework for history-aware hyperparameter optimisation in reinforcement learning uses execution history to adapt hyperparameters such as $\gamma$ during a single training lifetime, but it does not augment the agent’s state or policy with passed-forward history (Parra-Ullauri et al., 2023). Second, the acronym “HPR” is overloaded: in the Wasserstein barycenter literature it denotes Halpern-Peaceman-Rachford, an operator-splitting algorithm unrelated to reinforcement learning or history-passing mechanisms (Zhang et al., 2022).

2. Mechanisms of passing history

The papers instantiate history passing through several distinct carriers. In some cases, history is passed forward explicitly as a textual or trajectory state. In others, it is transformed into a recurrent or expectation-based summary that preserves only the information needed for prediction or credit assignment. In still others, the training signal is defined by comparing gold and tampered histories.

Mechanism	Representative formulation	History carrier
Counterfactual future-history evaluation	HAST computes a history advantage from gold versus tampered dialog histories	Wrong answer inserted into future dialog history
Recursively updated summary state	AIS uses $Z_t=\phi_t(S_{1:t},A_{1:t-1})$ , logistic DCMDPs use $\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$	Learned or model-specified recurrent summary
Expected predecessor summary	Predecessor Features use $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$	Expected discounted predecessor features
Explicit history in state	HARR uses $s_t=(\mathcal H_{t-1},q_t)$ , TrailBlazer uses $\hat s^{(t)}=[\phi(p^{(t)})\\|\{h^{(t-i)}\}_{i=1}^K]$ , CO $_2$ control uses $H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ , SWIRL uses $s_t^L$	Retrieval history, vulnerability trace, observation window, finite state history
Dataset-level history selection	LHF resamples whole learning histories according to improvement and stability	Filtered source learning histories

The counterfactual variant is exemplified by History-Advantage Sequence Training, where a wrong answer at round $\gamma$ 0 is intentionally passed into future dialog history and evaluated by its downstream effect on later rounds (Yang et al., 2019). The recurrent-summary variant appears in the theory of history-based policies, where $\gamma$ 1 defines a recursively updatable history abstraction, and in Dynamic Contextual MDPs, where the full influence of history on latent context is compressed into a low-dimensional statistic $\gamma$ 2 (Patil et al., 2022, Tennenholtz et al., 2023). The expectation-based backward view is represented by Predecessor Features, which replace sampled eligibility traces by expected discounted predecessor occupancies (Bailey et al., 2022).

The explicit-state variant is especially prominent in recent applications. HARR makes retrieval history part of the retriever state in multi-hop RAG; TrailBlazer stores prompt embeddings, response features, rewards, and mutator identities from prior jailbreak attempts; history-conditioned CO $\gamma$ 3 control passes the full well-observation history of the episode; and SWIRL augments state with the previous $\gamma$ 4 states so that both reward and policy are history-dependent (Zhang et al., 3 Feb 2026, Yoon et al., 6 Feb 2026, Fotias et al., 4 May 2026, Ke et al., 22 Jan 2025). A different but related intervention appears in in-context RL, where the issue is not how to encode history at deployment, but which source learning histories should be retained for pretraining; LHF addresses this by resampling whole histories according to improvement and stability scores (Chen et al., 21 May 2025).

3. Foundations in credit assignment and non-Markov control

A major precursor to HPR is the visual-dialog method History-Advantage Sequence Training. At dialog round $\gamma$ 5, the model receives an image $\gamma$ 6, dialog history

$\gamma$ 7

the current question $\gamma$ 8, and 100 candidate answers. HAST evaluates the effect of passing the correct answer $\gamma$ 9 or a wrong answer $Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 0 into future dialog history, and defines the history advantage

$Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 1

The corresponding update increases $Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 2 in proportion to the average downstream advantage over future turns. The method is actor-critic-inspired but uses a constructed adverse critic rather than a separately parameterized value function, and in practice approximates the adverse expectation with the top-5 negative answers (Yang et al., 2019). Mechanistically, this is a direct instance of reinforcement according to what happens after an answer is passed into ongoing history.

A second foundation is state augmentation for history-dependent rewards. In cooperative-competitive multi-agent reinforcement learning with history-dependent rewards, the reward at time $Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 3 includes a one-step lagged carry-over term

$Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 4

so the problem is not Markovian in the physical state alone. The paper restores Markovianity by augmenting the state with a reward-memory variable $Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 5, using

$Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 6

This is an explicit example of history being passed forward as a scalar memory state (He et al., 2020).

Theoretical formalization appears in work on history-based policies for MDP control. There, history-based abstractions are functions

$Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 7

with recursive update

$Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 8

The abstraction is an $Z_t=\phi_t(S_{1:t},A_{1:t-1})$ 9-Approximate Information State if reward and next-state distribution are approximately predictable from $\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 0. The resulting value-gap bound,

$\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 1

formalizes when passed-forward history can compensate for the non-Markov effects induced by lossy feature abstraction (Patil et al., 2022).

A closely related formal model is the Dynamic Contextual MDP. In logistic DCMDPs, the latent context distribution is history-dependent through the recurrent summary

$\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 2

and the context probabilities are given by a softmax of this statistic. The result is a structured non-Markov model in which long histories are compressed into a low-dimensional passed-forward summary rather than stored explicitly (Tennenholtz et al., 2023).

Backward credit assignment supplies another strand. Predecessor Features define

$\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 3

and then propagate TD errors through $\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 4 instead of a sampled eligibility trace. This does not preserve full trajectory order, but it does pass an expectation over viable predecessors forward in the current representation and then use that representation to send value updates backward (Bailey et al., 2022).

4. The explicit HPR algorithm in majority dynamics

The only paper in this set that names the method “History-Passing Reinforcement” introduces HPR for synchronous deterministic majority dynamics on large random $\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 5-regular graphs. Each node has state $\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 6, with local field

$\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 7

and synchronous update

$\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 8

The optimization target is the minimal initial magnetization $\sigma_{h+1}=\alpha \sigma_h + f_h(s_h,a_h,x_h)$ 9 required to reach all- $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 0 consensus within $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 1 steps. Backtracking dynamical cavity method (BDCM) predicts that for $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 2 with $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 3, and for $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 4 with $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 5, a global initial minority of $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 6 nodes should suffice. Representative replica-symmetric predictions are $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 7 for $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 8, $\boldsymbol{z}(s)=\mathbb{E}\left[\sum_{n=0}^{\infty}(\lambda\gamma)^n \mathbf{x}(S_{t-n})\mid S_t=s\right]$ 9 for $s_t=(\mathcal H_{t-1},q_t)$ 0, $s_t=(\mathcal H_{t-1},q_t)$ 1 for $s_t=(\mathcal H_{t-1},q_t)$ 2, and $s_t=(\mathcal H_{t-1},q_t)$ 3 for $s_t=(\mathcal H_{t-1},q_t)$ 4 (Jankola et al., 17 Dec 2025).

Algorithmically, HPR reinforces BDCM messages over full node trajectories. The core update is

$s_t=(\mathcal H_{t-1},q_t)$ 5

where $s_t=(\mathcal H_{t-1},q_t)$ 6 biases the initial spin of the neighboring trajectory. Node marginals over the initial spin are estimated by

$s_t=(\mathcal H_{t-1},q_t)$ 7

and the biases are updated according to whether $s_t=(\mathcal H_{t-1},q_t)$ 8 exceeds $s_t=(\mathcal H_{t-1},q_t)$ 9, with reinforcement strength $\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 0. Bias updates are applied with probability

$\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 1

and message updates are damped by

$\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 2

A trial initialization is extracted by

$\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 3

For majority dynamics, the paper exploits permutation symmetry of the neighbors and rewrites the local constraint in terms of summed neighbor trajectories, reducing the per-iteration complexity from $\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 4 to $\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 5. Empirically, HPR finds explicit minority-takeover initializations for $\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 6 and $\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 7: $\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 8 for $\hat s^{(t)}=[\phi(p^{(t)})\|\{h^{(t-i)}\}_{i=1}^K]$ 9, $_2$ 0 for $_2$ 1, and $_2$ 2 for $_2$ 3. A simulated-annealing baseline remains positive in those cases. HPR nonetheless does not reach the lower densities predicted by replica-symmetric BDCM, and its best performance lies near the onset of a dynamical one-step replica symmetry breaking phase, which the paper identifies as the likely practical barrier (Jankola et al., 17 Dec 2025).

5. Application domains and empirical behavior

In retrieval-augmented generation, HARR formulates retriever optimization as an MDP with history-aware state

$_2$ 4

where $_2$ 5 is the retrieval history and $_2$ 6 is the current sub-query. Actions are ordered document lists sampled with a Plackett–Luce policy, rewards are sparse terminal token-level F1 scores, and optimization uses Group Relative Policy Optimization. The history component is directly ablated: on the ReAct pipeline, removing history degrades 9 of 10 metrics relative to full HARR, and removing RL degrades all 10 metrics. For Qwen3-Embedding-4B on HotpotQA, full HARR reaches 32.34 EM / 41.55 F1, versus 31.56 / 40.74 without history and 31.42 / 40.63 without RL (Zhang et al., 3 Feb 2026).

In black-box LLM jailbreaking, TrailBlazer converts prompt mutation into a sequential RL problem in which each historical record is

$_2$ 7

with response features $_2$ 8 comprising refusal flag, perplexity, normalized length, and toxicity. The raw history variant uses

$_2$ 9

and the attention-based variant replaces the window by an attended summary $H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 0. Empirically, the jump from the memoryless baseline to history-aware RL is large: on LLaMA 3.2-11B, baseline ASR is 37.18%, HRL is 60.25%, and AHRL is 95.51%; on GPT-oss-20B, baseline ASR is 4.48%, HRL is 74.84%, and AHRL is 85.30% (Yoon et al., 6 Feb 2026).

In partially observed reservoir control, history-conditioned CO $H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 1 storage policies use a full episode-level buffer

$H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 2

and a conv-plus-gated-transformer encoder to summarize well observations. This policy uses only deployable well-level information, not privileged simulator fields. The reported final return is 19.622 for the history-conditioned policy, compared with 19.604 for the privileged-state benchmark and 18.345 for the well-only baseline, leading the paper to conclude that temporal context is the dominant ingredient for overcoming partial observability in this setting (Fotias et al., 4 May 2026).

In inverse reinforcement learning for animal behavior, SWIRL introduces hidden modes $H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 3, state-dependent mode transitions $H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 4, and history-dependent reward and policy maps

$H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 5

The full S-2 model combines decision-level state dependence in the hidden transition with action-level history dependence in reward and policy. In the simulated gridworld, only S-2 accurately recovers the true reward maps; in the water-restricted labyrinth, state dependency in hidden-mode switching and history dependency in reward both improve held-out test log-likelihood and yield interpretable water, home, and explore modes (Ke et al., 22 Jan 2025).

At the pretraining-data level, filtering histories rather than passing all of them indiscriminately also changes in-context RL behavior. LHF assigns each source learning history a score

$H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 6

converts this score into a retention probability, and resamples whole histories before transformer pretraining. On Darkroom-type tasks, average relative enhancement is 8.8% for AD, 9.1% for DICP, and 11.9% for DPT; with noisy data, the gains become larger, including 90.7% for AD on Darkroom. The main implication is that passed history can transmit source suboptimality as well as useful adaptation behavior (Chen et al., 21 May 2025).

6. Limitations, boundaries, and terminological ambiguity

Across the literature, HPR-style methods inherit several recurring limitations. Counterfactual dialog training still fills future rounds after the tampered turn with ground-truth QA pairs rather than fully free-running conversation, and its adverse critic is approximated by top-5 negatives rather than the full candidate set (Yang et al., 2019). Explicit history-in-state methods often rely on sparse terminal rewards, candidate-pool restrictions, or short fixed windows; for example, HARR explores only within a top- $H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 7 candidate set, and TrailBlazer uses a fixed history window over coarse vulnerability features (Zhang et al., 3 Feb 2026, Yoon et al., 6 Feb 2026). History-conditioned CO $H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 8 control does not include a history-length ablation or a direct comparison with recurrent policies carrying hidden state across steps (Fotias et al., 4 May 2026).

Compression is another common tradeoff. Predecessor Features intentionally replace exact trajectories by expected discounted predecessor occupancies, which can blur rare but causally decisive events (Bailey et al., 2022). Logistic DCMDPs restrict non-Markov dependence to additive discounted summaries of local feature increments, which excludes arbitrary order-sensitive history effects (Tennenholtz et al., 2023). SWIRL uses state augmentation with finite history length $H_t=[y_{t-L+1}^{\text{well}},\ldots,y_t^{\text{well}}]$ 9, leading to tabular costs that scale as $s_t^L$ 0 in its soft-Q inner loop (Ke et al., 22 Jan 2025). The explicit HPR algorithm for majority dynamics is linear in graph size $s_t^L$ 1 after its dynamic-programming speedup, but still exponential in trajectory length $s_t^L$ 2, which confines it to short-horizon dynamical optimization (Jankola et al., 17 Dec 2025).

The boundary with adjacent work remains important. The history-aware hyperparameter optimisation framework uses Complex Event Processing and Temporal Models to adapt $s_t^L$ 3 online from reward-window statistics, but history is used to control hyperparameters rather than passed into the agent’s policy or value representation (Parra-Ullauri et al., 2023). Conversely, the acronym HPR can denote an entirely different object: the Halpern-Peaceman-Rachford algorithm for the Wasserstein barycenter problem (Zhang et al., 2022). Taken together, this suggests that “History-Passing Reinforcement” is presently best understood as a mechanistic label spanning counterfactual history rollouts, explicit history-conditioned state representations, recursively updated summaries, predecessor-based backward credit assignment, and history-aware data curation, rather than as a single standardized Bellman framework.