Trajectory-level Reward Shaping in RL
- Trajectory-level Reward Shaping (TLRS) is a reinforcement learning technique that adjusts rewards based on entire agent trajectories rather than individual steps.
- It leverages expert demonstrations, language guidance, human feedback, and subgoal aggregation to enhance credit assignment and sample efficiency.
- TLRS is applied in robotics, high-dimensional control, and language models, offering enhanced performance in environments with sparse, delayed, or complex reward structures.
Trajectory-level reward shaping (TLRS) refers to any reinforcement learning (RL) technique that alters the reward signal at the granularity of entire agent trajectories or temporally extended segments, instead of, or in addition to, conventional state-action reward feedback. TLRS methods evaluate, generate, or modify reward signals based on properties of the whole trajectory—such as its similarity to expert demonstrations, alignment with linguistic instructions, aggregated statistics, or compliance with human feedback—to produce denser, more informative, and temporally aware supervision for learning effective policies in environments with sparse, delayed, misspecified, or complex reward structures.
1. Fundamental Concepts and Classifications
TLRS encompasses a spectrum of methodologies unified by the principle of shaping, generating, or reconstructing reward signals based on trajectory-level (rather than per-step) information. The motivations include sample efficiency, improved credit assignment, human-in-the-loop correction, leveraging non-Markovian demonstrations, and facilitating interpretability or alignment.
Approaches can be broadly classified as follows:
TLRS Class | Description | Representative References |
---|---|---|
Model-based Internal Prediction | Internal predictive models shape reward by comparing actual and expert-predicted state transitions | (Kimura et al., 2018) |
Guidance Rewards via Trajectory Smoothing | Trajectory distribution smoothing redistributes sparse/delayed rewards to provide guidance at sequence level | (Gangwani et al., 2020) |
Reward Decomposition via Expert/Language | Intermediate/auxiliary trajectory rewards generated via expert trajectory subsequence matching or language alignment | (Zhao et al., 27 Jul 2025, Cao et al., 2023, Carta et al., 2022) |
Human Feedback and Preference | Human marks, ranks, or critiques whole trajectories to induce reward modifications | (Gajcin et al., 2023, Muslimani et al., 8 Mar 2025) |
Potential or Subgoal-based Aggregation | Subgoal series, dynamic aggregation, or history-dependent reward transformation based on trajectory segment accomplishment | (Okudo et al., 2021, Okudo et al., 2021) |
Semi-supervised and SSL-based Approaches | Leverage unlabeled (zero-reward) transitions and trajectory consistency for learned shaping signals | (Li et al., 31 Jan 2025, Zhou et al., 10 Jun 2025) |
Task-specific Control Specification | Reward correction terms constructed to ensure global dynamic or control-theoretic constraints are respected | (Lellis et al., 2023, Lyu et al., 2019) |
Response-level and LLM-specific | Trajectory (response-level) rewards are used for entire language outputs, supporting unbiased token-level policy gradients | (He et al., 3 Jun 2025) |
2. Internal Model and Trajectory Matching Approaches
Internal model-based TLRS derives reward signals by measuring the temporal consistency between the agent's actual trajectories and an internal predictive model learned from expert demonstration trajectories—with or without actions. For example, given a set of expert state-only trajectories , an agent learns a predictive model (often a recurrent network) trained to maximize the likelihood . During RL, the reward is computed as:
where is a shaping function (linear, , Gaussian). This construction provides a dense, temporally-informed reward aligned with expert behavior, facilitating faster convergence even when hand-crafted rewards are sparse or unavailable (Kimura et al., 2018).
Trajectory-level reward shaping via exact subsequence (partial formula) matching, as in RL-based alpha factor mining, defines a potential function
where is the count of expert demonstration subsequences matching the agent's expression up to , and is the total. Shaping rewards are the difference in these ratios across timesteps:
This matches expert structure at the subsequence level and can achieve significant improvements in both efficiency and performance, with a 9.29% gain in predictive Rank IC over prior methods (Zhao et al., 27 Jul 2025).
3. Guidance Rewards, Smoothing, and Aggregation Strategies
TLRS methods can convert sparse or delayed rewards into dense guidance signals using trajectory-space smoothing. By defining a surrogate RL objective that smooths over trajectory distributions (e.g., via a mixture distribution centered around a reference trajectory), the expected return is redistributed to state-action pairs as guidance rewards:
This guidance reward, interpreted as a uniform or maximum-entropy credit assignment over the trajectory, is integrated into standard RL algorithms (Q-learning, actor-critic, distributional methods) with minimal modification, resulting in sample-efficient learning under sparsity or long-term delay (Gangwani et al., 2020).
Dynamic trajectory aggregation approaches partition agent experiences into abstract states using pre-specified subgoal series, aggregating the reward over multi-step segments. In the SMDP formulation, the shaping potential is defined over these abstract trajectory segments, either with pre-given subgoal indices or learned value functions, yielding update rules such as:
This allows reward propagation across subtask boundaries, benefiting high-dimensional or continuous domains (Okudo et al., 2021).
4. Human Feedback, Alignment, and Preferences
TLRS serves as the foundation for reward alignment and human-in-the-loop RL. In iterative human feedback pipelines, agents present trajectory summaries to users, who flag undesirable behaviors at the sequence or segment level and annotate with explanations (feature-based, action-based, or rule-based). These annotations are augmented into datasets using randomized perturbations (maintaining key features), enabling neural reward shaping models to generalize human critique over the trajectory space (Gajcin et al., 2023). The resulting models add shaping terms, e.g.,
This supports effective correction of misspecified reward functions and achieves performance on par with expert policies using minimal human input.
Measurement of reward alignment with human intentions can be formalized by the Trajectory Alignment Coefficient (), defined as the Tau–b correlation of the human's ranking of trajectories and that induced by the current reward function. Notably, is invariant to potential-based reward shaping and linear reward transformations:
where and count concordant and discordant pairs, , count ties. It enables practitioners to diagnose and select reward functions that encode human objectives, as evidenced by increased success rate (by 41%) in user studies (Muslimani et al., 8 Mar 2025).
5. Language-based and Auxiliary Objective TLRS
Recent work exploits natural language as a trajectory-level auxiliary objective for reward shaping. In language-guided RL, agents convert instructions into a set of questions by masking key content (QG), and then, during exploration, generate trajectories evaluated by a QA system. Intrinsic rewards proportional to QA confidence are assigned when the agent's trajectory unambiguously explains aspects of the instruction. The total auxiliary reward for a trajectory is:
Density and policy-invariance are preserved by neutralization steps at episode end (Carta et al., 2022), yielding improved exploration efficiency in sparse language-conditioned control.
In video-language alignment frameworks, trajectory-language pairs are encoded via image and Transformer-based LLMs (e.g., ResNet-18, BERT) and the affinity predicted between the trajectory and instruction. The predicted alignment is used as a shaping component:
Guidance at the trajectory level accelerates difficult RL tasks, as shown in Montezuma's Revenge, where intermediate alignment-based rewards substantially improved task completion rates (Cao et al., 2023).
6. Semi-supervised and Consistency-driven TLRS
In environments where the majority of transitions are zero-reward, semi-supervised learning (SSL) and consistency regularization offer a scalable TLRS solution. SSRS learns representations over the trajectory space using both labeled (reward) and unlabeled (zero-reward) data. A novel double entropy data augmentation multiplies submatrices of a state matrix by their Shannon entropy:
resulting in invariance to input perturbations and enhanced clustering in latent space. Consistency losses enforce that reward predictions are stable across augmentations, while monotonicity constraints regularize the relationship between and (Li et al., 31 Jan 2025). Empirical results show up to 4 higher best scores in sparse Atari games, a 15.8% gain due to the augmentation strategy.
Intra-trajectory consistency approaches in reward modeling for LLMs further refine sequence-level reward propagation by enforcing that adjacent partial generations with high next-token probability receive consistent rewards. The regularization loss
propagates response-level supervision into aligned fine-grained token rewards, boosting RLHF and best-of-N response quality (Zhou et al., 10 Jun 2025).
7. Applications, Impact, and Future Directions
TLRS has led to substantial advances in a diverse set of RL domains:
- Robotics and Control: Time-to-reach models (Lyu et al., 2019) and reward correction terms (Lellis et al., 2023) inject physical priors and temporal constraints, improving exploration and enforcing global control requirements for tasks like pendulum balancing and lunar landing.
- High-dimensional and Sparse Settings: Object-goal navigation (Madhavan et al., 2022) and financial formula mining (Zhao et al., 27 Jul 2025) demonstrate that TLRS enables robust policy learning in environments where standard reward formulation is ill-posed, computationally expensive, or domain knowledge is fragmented.
- Real-time Games and Multi-task RL: Carefully engineered TLRS functions (boundary, tag, energy shaping, etc.) facilitate fast convergence, task specialization, and generalization in competitive games (Kliem et al., 2023).
- LLMs and Sequence Modeling: Response-level reward shaping formalizes credit assignment across entire model outputs, mathematically ensuring that scalar end-of-trajectory feedback supports unbiased token-level policy gradients (see Trajectory Policy Gradient Theorem) (He et al., 3 Jun 2025). Algorithms such as TRePO exploit this equivalence for efficient and practical LLM RL alignment.
Structural properties, such as alignment invariance under potential-based shaping, support TLRS’s compatibility with theoretical guarantees for policy invariance and reward evaluation stability. TLRS methods increasingly employ modular reward architectures, allowing hierarchical, adaptive, or cross-modal reward shaping with human-in-the-loop or unsupervised data.
Future research directions include further integration with LLMs for semantic alignment (Zhao et al., 27 Jul 2025), richer augmentation and meta-learning for reward functions, and deeper exploration of gradient-based adaptive TLRS methods for robust, interpretable, and scalable RL in complex domains.