Reinforcement Learning Guidance Overview

Updated 4 July 2026

Reinforcement Learning Guidance (RLG) is a set of methods that augment traditional RL by incorporating auxiliary signals like evaluative feedback, language advice, and logic constraints.
Key implementations adjust aspects such as data collection, policy updates, and action admissibility via learned dense rewards, teacher policies, and state masking.
RLG has practical applications in robotics, aerospace, and diffusion models, demonstrating improvements in exploration efficiency, safety, and overall performance.

Reinforcement Learning Guidance (RLG) denotes a family of methods that steer reinforcement learning with structured signals beyond, or in addition to, the native environment reward. In the literature, those signals include evaluative feedback, preferences, high-level goals, state-only observation, and attention from humans; analytically derived dense guidance rewards from trajectory returns; teacher policies and suboptimal controllers; natural-language or large-language-model advice; formal constraints and temporal-logic specifications; and guidance extracted from prior or self-generated trajectories (Zhang et al., 2019, Gangwani et al., 2020, Shenfeld et al., 2023, Spieker, 2021, Hasanbeig et al., 2019, Wang et al., 2024). The common purpose is to improve credit assignment, exploration, coordination, safety, or alignment by altering the learning signal, the policy update, the admissible action set, or the data-collection process.

1. Conceptual scope and historical framing

A 2019 survey organized human guidance beyond conventional step-by-step action demonstrations into five frameworks: learning from evaluative feedback, learning from human preference, hierarchical imitation, imitation from observation, and learning attention from humans (Zhang et al., 2019). That framing is important because it already treats guidance as broader than imitation learning: the human need not execute the policy, and the signal need not be an action label. Instead, the signal may be a judgment, a comparison, a high-level option, a state-only trajectory, or an attention pattern.

Subsequent work broadens that scope further. Guidance appears as a learned dense reward derived from returns over trajectories containing a state-action pair; as a teacher policy whose influence is adjusted by comparison to a reward-only counterfactual; as state-conditional language advice that shapes action choice; as uncertainty-aware LLM action advice; as constraint-based intervention at the observation, action, or internal action-selection interface; as temporal-logic progress shaping via automata; and as trajectory-space regularization toward self-generated memories (Gangwani et al., 2020, Shenfeld et al., 2023, Tasrin et al., 2021, Shoaeinaeini et al., 2024, Spieker, 2021, Hasanbeig et al., 2019, Wang et al., 2024). This suggests that RLG is best understood as an umbrella concept for auxiliary structure injected into RL, rather than a single algorithmic family.

A recurrent misconception is that guidance is synonymous with reward shaping. The literature does not support that reduction. Some methods change only the reward signal; others alter policy selection, exploration support, action admissibility, representation learning, or the interaction loop itself. RLG therefore spans reward-side, policy-side, action-side, and data-collection-side interventions.

2. Guidance as learned rewards and surrogate objectives

A particularly explicit reward-side formulation appears in "Learning Guidance Rewards with Trajectory-space Smoothing" (Gangwani et al., 2020). In an infinite-horizon discounted MDP $(\mathcal{S}, \mathcal{A}, r, p, \gamma)$ with return

$R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$

the paper replaces the original objective with a trajectory-smoothed surrogate and derives a learned guidance reward

$r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$

This reward is the expected trajectory return under a behavior-induced distribution of trajectories that contain $(s,a)$ . The appendix makes the empirical interpretation explicit: given a dataset of trajectories generated by $\beta$ , $r_g(s,a)$ is the average return of previously seen trajectories containing that state-action pair. The authors interpret this as a simple form of uniform return redistribution.

The practical algorithm, Iterative Relative Credit Refinement (IRCR), does not train an auxiliary reward model. In tabular Q-learning it stores episodic returns in a buffer $\mathcal{B}(s,a)$ for each visited state-action pair and uses the average min-max normalized return as the reward replacing the environment reward. In SAC, TD3, and a C51-style distributional variant, each transition from an episode is relabeled with the episode return, normalized to $[0,1]$ , and that value is used in the critic target. The stated advantages are delay invariance, dense per-step supervision, simple integration with existing pipelines, no auxiliary networks, and applicability across single-agent, multi-agent, actor-critic, and distributional RL.

The same paper is explicit about the limits of this formulation. The reward is not potential-based, and the method optimizes the smoothed surrogate objective $\eta_{\text{smooth}}(\pi_\theta)$ , not the original $\eta(\pi)$ . There is no theorem of policy invariance relative to the native reward, the learned reward depends on the behavior distribution $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 0, and poor or adversarial behavior distributions can produce deceptive guidance rewards and local optima. The paper also distinguishes credit assignment from exploration: IRCR assumes some reward is available at least at the end of each episode and is not designed for hard-exploration settings in which all episodes return zero.

3. Guidance as teacher-, human-, and language-conditioned policy shaping

Teacher-guided formulations make the trust problem explicit: guidance can accelerate learning, but it can also trap the learner if the teacher is suboptimal or privileged. "TGRL: An Algorithm for Teacher Guided Reinforcement Learning" formulates this as

$R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 1

where the main policy is trained with reward plus imitation and the auxiliary policy $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 2 is trained only from reward (Shenfeld et al., 2023). Dualizing the constraint yields the effective imitation weight

$R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 3

so teacher influence is increased only when the guided learner outperforms the reward-only counterfactual. The same paper emphasizes that this handles privileged teachers and suboptimal teachers, and empirically reports that in Shadow Hand tactile manipulation RL alone achieves $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 4 success, the teacher $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 5, pure teacher-student learning $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 6, and TGRL $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 7.

A related but more localized trust mechanism appears in "Accelerating Reinforcement Learning with Suboptimal Guidance" (Bøhn et al., 2019). There, a suboptimal controller provides actions, but imitation is applied only when a Q-filter judges the guide better than the learner in the sampled state: $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 8 The behavior cloning loss is thus state-dependent, and the paper argues that using the guide’s own static value function $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r(s_t,a_t),$ 9 is more adaptive than evaluating both actions with the learner’s immature critic.

Human guidance is not restricted to teachers or demonstrations. "Opinion-Guided Reinforcement Learning" models uncertain state evaluations as subjective-logic opinions $r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 0, fuses them with policy probabilities via Belief Constraint Fusion, and uses the resulting shaped policy as the initial policy for subsequent RL (Dagenais et al., 2024). The paper reports that opinions, even if uncertain, improve performance, and gives concrete cumulative-reward examples in Frozen Lake, including an unadvised mean cumulative reward of $r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 1 and oracle-guided performance of $r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 2 at $r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 3.

Natural language and LLMs have extended policy shaping further. "Influencing Reinforcement Learning through Natural Language Guidance" introduces Automated Advice Aided Policy Shaping, with an Experience Driven Agent, an Advice Generator, and an Advice Driven Agent combined by

$r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 4

with $r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 5 decayed over training (Tasrin et al., 2021). "Guiding Reinforcement Learning Using Uncertainty-Aware LLMs" uses a fine-tuned BERT advisor calibrated by Monte Carlo Dropout, then mixes the LLM and PPO policies according to entropy-based uncertainty: $r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 6 On Minigrid Unlock Pickup, the reported AUC values are $r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 7 for the uncertainty-aware method, $r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 8 for uncalibrated LLM guidance, $r_g(s,a;\beta) = \mathbb{E}_{\tau \sim p_\beta(\tau; s,a)}[R(\tau)].$ 9 for calibrated linear-decay shaping, and $(s,a)$ 0 for unguided RL (Shoaeinaeini et al., 2024).

4. Guidance as constraints, logic, and safe transfer

Constraint-guided work changes the interaction loop rather than the scalar reward. "Constraint-Guided Reinforcement Learning: Augmenting the Agent-Environment-Interaction" defines three interfaces for guidance: observation augmentation $(s,a)$ 1, post-hoc action replacement $(s,a)$ 2, and internal action masking $(s,a)$ 3 (Spieker, 2021). In the paper’s card-game and grid-world case studies, these interventions improve reliability, safer behavior, and accelerated training, with action masking strongest when invalid actions are common and observation masking strongest asymptotically in the grid-world task with combinatorial subgoals.

Logic-guided work replaces hand-designed shaping with automaton progress. "Certified Reinforcement Learning with Logic Guidance" translates an LTL specification into a Limit-Deterministic Generalised Büchi Automaton, synchronizes it with the MDP, tracks an accepting frontier, and gives positive reward when the next automaton state enters a currently required accepting set (Hasanbeig et al., 2019). Under stated assumptions and sufficiently large discount, the optimal policy for the shaped product process also maximizes the probability of satisfying the LTL formula. The paper’s contribution is therefore not only practical reward design but a certification link between return maximization and formal task satisfaction.

Safe-transfer guidance introduces a separate guide policy. "Reinforcement Learning by Guided Safe Exploration" trains a reward-free safe guide in a controlled source task, then uses that guide in the target task through KL-style policy regularization and composite behavior-policy sampling (Yang et al., 2023). The regularization weight is set to the safety Lagrange multiplier, $(s,a)$ 4, so guidance is strongest when the student is unsafe and fades as safety improves. The paper reports that control-switch, which hands control to the guide after the first unsafe event in a trajectory, is better than linear decay, and proves a safety-transfer result for the guide under a cost-preserving abstraction $(s,a)$ 5.

Taken together, these works show that RLG can operate through hard or soft constraints, symbolic temporal structure, and safe fallback behavior. They also show that guidance need not be a scalar reward bonus; it can be a runtime intervention or a formal state-space augmentation.

5. Guidance from self-generated experience and decentralized intent

Some guidance methods reuse the learner’s own past trajectories. "Learning Diverse Policies with Soft Self-Generated Guidance" introduces POSE, which stores self-generated trajectories in a memory

$(s,a)$ 6

and constrains policy improvement so current trajectories remain near promising stored trajectories in behavior space, measured with Maximum Mean Discrepancy (Wang et al., 2024). The guidance objective is not imitation but trajectory-space regularization: $(s,a)$ 7 The paper’s emphasis is that even imperfect or suboptimal trajectories can be useful guidance, because they may enter promising regions of the state space in sparse or deceptive environments.

In decentralized multi-robot coverage, guidance has been used to encode likely cooperative intent. "Decentralized Coverage Path Planning with Reinforcement Learning and Dual Guidance" introduces DODGE, which combines artificial potential field guidance

$(s,a)$ 8

for coverable nodes and heuristic guidance

$(s,a)$ 9

for candidate actions (Liu et al., 2022). APF guidance is inserted as a node feature, while heuristic guidance adjusts decoder attention scores before masking and softmax. On $\beta$ 0 maps, the reported overlap rates are $\beta$ 1, $\beta$ 2, and $\beta$ 3 for 8, 14, and 20 agents, compared with $\beta$ 4, $\beta$ 5, and $\beta$ 6 for AWSTC.

These methods make a broader point: guidance can be extracted from prior trajectories or from structured approximations of others’ future claims. In both cases, RLG supplies an intermediate signal about where useful behavior is likely to lie, while leaving the final policy adaptive.

6. Applications, limitations, and broader extensions

Robotic manipulation has turned guidance into a real-time supervisory layer. "Accelerating Robotic Reinforcement Learning with Agent Guidance" proposes Agent-guided Policy Search, in which a multimodal agent implemented with Qwen3-VL-235B-A22B-Instruct provides either action guidance or exploration pruning, triggered asynchronously by FLOAT, an optimal-transport failure detector over DINOv2 embeddings (Chen et al., 12 Feb 2026). In USB insertion, AGPS reaches $\beta$ 7 success by step 400 and $\beta$ 8 by step 600, while in Chinese knot hanging it reaches $\beta$ 9 success by step 3000 and $r_g(s,a)$ 0 by step 4000. The paper’s central interpretation is that the agent acts as a semantic world model that supplies intrinsic value priors over task-relevant regions.

Aerospace guidance has used RL guidance in a more literal control-theoretic sense. "Adaptive Guidance with Reinforcement Meta-Learning" trains recurrent PPO policies for Mars landing, asteroid landing, engine-failure accommodation, and radar-only guidance/navigation; the recurrent hidden state serves as an implicit online identifier of latent dynamics, and recurrent policies outperform both non-recurrent policies and DR/DV in the reported tasks (Gaudet et al., 2019). "Reinforcement Learning for Angle-Only Intercept Guidance of Maneuvering Targets" learns a policy that maps stabilized seeker angles and their changes,

$r_g(s,a)$ 1

directly to binary divert-thruster commands, and reports better hit statistics than augmented ZEM despite not requiring range estimation (Gaudet et al., 2019). "Reinforcement Learning for Gliding Projectile Guidance and Control" uses PPO to map a six-component observation $r_g(s,a)$ 2 to two continuous control commands and reports mean miss distances of $r_g(s,a)$ 3 m without wind and $r_g(s,a)$ 4 m with an east wind of $r_g(s,a)$ 5, compared with $r_g(s,a)$ 6 m and $r_g(s,a)$ 7 m for a PID baseline (Cahn et al., 30 Nov 2025).

The term has also been extended beyond classical sequential control. "Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance" reinterprets an RL-fine-tuned diffusion or flow-matching model as an implicitly reward-conditioned model and combines the base and RL-aligned outputs at inference time: $r_g(s,a)$ 8 The paper argues that the guidance scale $r_g(s,a)$ 9 is mathematically equivalent to changing the KL coefficient from $\mathcal{B}(s,a)$ 0 to $\mathcal{B}(s,a)$ 1, thereby turning a fixed post-training alignment decision into an inference-time control knob (Jin et al., 28 Aug 2025). This is not classical RL policy search, but it shows that “RLG” has become a broader alignment term for RL-induced control over generative distributions.

Across the literature, several limitations recur. Guidance often optimizes a modified objective rather than the original task objective; its utility depends on the quality, calibration, or relevance of the guidance source; and it may fail when terminal rewards are absent, when teachers or advisors are misleading, or when structured priors are badly grounded (Gangwani et al., 2020, Shoaeinaeini et al., 2024, Chen et al., 12 Feb 2026). A plausible implication is that RLG is most reliable when the auxiliary signal is informative but not over-trusted, and when the method contains an explicit mechanism for attenuating or overriding that signal as learning progresses.