Stage-Aligned Reward (SAR)

Updated 23 November 2025

Stage-Aligned Reward (SAR) is a reinforcement learning paradigm that decomposes sequential tasks into discrete or interpolated stages with customized reward and cost functions.
It enables precise control in complex robotic and long-horizon tasks by aligning rewards with natural task phases to improve data efficiency and policy performance.
SAR supports varied implementations such as strict gating and soft interpolation, offering robustness in acrobatic locomotion, manipulation, and other multi-phase applications.

Stage-Aligned Reward (SAR) is a class of reward-shaping paradigms in reinforcement learning (RL) that decomposes sequential decision-making tasks into discrete or smoothly interpolated stages, allowing for the definition of tailored reward and cost functions for each stage. SAR addresses the challenge of crafting monolithic reward signals for complex, long-horizon, or multi-phase robotic tasks by aligning reward structure to the natural semantic or dynamical decomposition of the target task. This approach has been instantiated across various domains, including acrobatic robotics, demonstration-augmented visual RL, trajectory optimization, and long-horizon manipulation, as evidenced by the literature (Kim et al., 24 Sep 2024, Escoriza et al., 3 Mar 2025, Peng et al., 2020, Chen et al., 29 Sep 2025).

1. Formal Definition and Variants of Stage-Aligned Reward

SAR partition an episode into $k$ disjoint stages $S_1, S_2, \dots, S_k$ , with each stage corresponding to a subset of the state-time domain, identified by a task-specific stage scheduler (or stage-indicator function) $\sigma$ or $r(s)$ . The scheduler inspects the agent’s state $s_t$ to return the active stage index:

$S_i = \{\, t \mid \sigma(s_t) = i \}, \quad \text{or} \quad S_i = \{\, s \mid r(s) = i \}, \quad i=1,\dots,k$

Reward and cost functions (or auxiliary shaping terms) are only active when the agent is in stage $i$ . In the strict gating case, all other stage-specific terms are zeroed. In soft or interpolated SAR, reward contributions from different stages are blended using continuous coefficients as a function of task progress (Peng et al., 2020).

Multiple implementations exist:

Strict per-stage activation, as in acrobatic robot control (Kim et al., 24 Sep 2024)
Smooth interpolation between guidance signals, as in robotic reaching (Peng et al., 2020)
Continuous, data-driven within-stage progress bonuses, as in demonstration-augmented RL (Escoriza et al., 3 Mar 2025) and vision-based manipulation (Chen et al., 29 Sep 2025)

2. Design and Learning of Per-Stage Reward and Cost Functions

For each stage $i\in\{1,\dots,k\}$ , SAR specifies a vector of reward terms $r_i(s,a)$ and, optionally, constraint cost terms $c_i(s,a)$ . These functions are typically objective-specific:

In acrobatic tasks, $r_i$ may represent height, speed, or rotation rewards, while $c_i$ handles safety costs such as undesired contacts (Kim et al., 24 Sep 2024).
In model-based RL, a learned discriminator $\delta_i$ provides a dense reward adjustment within each stage, based on visual or proprioceptive inputs; this is calibrated to preserve reward monotonicity across stage boundaries (Escoriza et al., 3 Mar 2025).
In manipulation, semantically meaningful progress within each stage is estimated by a neural regressor trained on natural-language annotated demonstration data, ensuring invariance to duration and style (Chen et al., 29 Sep 2025).
For robotic reaching, separate posture and stride rewards are developed for “fast approach” vs. “fine adjustment” phases, with incentive mechanisms selecting or blending the contributions (Peng et al., 2020).

The per-stage reward functions are either user-defined (expert-driven), learned from demonstration data, or a hybrid. In all cases, clear separation of objectives ensures each stage provides a focused learning signal.

3. Algorithmic Integration into Reinforcement Learning

SAR can be embedded in both model-free and model-based RL in several ways:

Constrained Multi-Objective RL (CMORL): The problem is formulated as maximizing the aggregate of stage-specific cumulative rewards, subject to cost constraints:

$\max_\pi \sum_{i=1}^k J_{R_i}(\pi) \;\; \text{subject to} \;\; J_{C_i}(\pi) \leq d_i, \; i=1,\dots,k$

The CoMOPPO algorithm extends PPO to this setting by computing per-stage advantages, normalizing across objectives, and aggregating them for the policy update (Kim et al., 24 Sep 2024).

Demonstration-Augmented Model-Based RL: SAR is used to densify sparse rewards in long-horizon manipulation by learning stage classifiers and within-stage reward regressors from demonstrations. Dense, stage-aligned rewards are injected into all model learning losses, including world dynamics, policy, and reward heads (Escoriza et al., 3 Mar 2025).
Actor-Critic Methods: In deep actor-critic implementations, SAR assigns reward weights per stage (“hard” switch or “soft” interpolation), branching the reward signal, and updating accordingly in the learning loop (Peng et al., 2020).
Reward Modeling for Imitation: In SARM, a video-based model predicts both stage and within-stage progress, producing a monotonic pseudo-reward used to filter and weight demonstration data for robust policy cloning (Chen et al., 29 Sep 2025).

4. Stage Scheduling, Alignment, and Progress Estimation

Stage alignment is enforced in SAR via deterministic or learned scheduling:

Deterministic Schedulers: State-dependent rules using features such as target distance, body posture, or contact events define stage transitions. Only the current stage’s reward and cost are evaluated (Kim et al., 24 Sep 2024, Peng et al., 2020).
Data-Driven Alignment: Stage labels are automatically derived from annotated demonstrations, or learned discriminators continuously predict the active stage and estimate progress within it. This approach is robust to variable-length trajectories and demonstration heterogeneity (Chen et al., 29 Sep 2025, Escoriza et al., 3 Mar 2025).
Soft Transitioning: When transitions are not cleanly separable, soft weighting interpolates rewards from overlapping stages, mitigating instabilities at stage boundaries (Peng et al., 2020).
Reset and Recovery: Upon deviance or failure (e.g., unsafe body contact), policies may reset or remain in stage until constraints are satisfied, discouraging undesired behavior via per-stage costs (Kim et al., 24 Sep 2024).

Stage alignment thus provides both structural priors and progress feedback, supporting both exploration and credit assignment in RL.

5. Empirical Evaluation and Benchmarking

SAR methods demonstrate substantial gains across complex robotic tasks:

Acrobatic Locomotion: For 5-stage back-flip, side-flip, and two-hand walk on quadruped and humanoid platforms, SAR (via CoMOPPO) achieves over 90% success in sim-to-real transfer, with >30% higher flip success and zero body collisions versus monolithic reward or P3O baselines. Stage completion time variance is reduced by 40%, and policies avoid getting stuck in intermediate stages (Kim et al., 24 Sep 2024).
Demonstration-Augmented Manipulation: Dense SAR leads to 40–70% improved data efficiency and robust policy learning from sparse demonstrations for complex humanoid and manipulation tasks (Escoriza et al., 3 Mar 2025).
Trajectory Planning: Soft SAR accelerates convergence by up to 46.9%, increases mean return by 4.4–15.5%, and reduces variance by 21.9–63.2%. Success rates for reach tasks approach 99.6% (Peng et al., 2020).
Long-Horizon Deformable Manipulation: In T-shirt folding, SARM reward modeling achieves mean squared error of 0.009 on held-out demos and rollout classification correlation $\rho = 0.94$ , substantially outperforming prior methods. Reward-Aligned Behavior Cloning with SARM delivers 83% success folding from flattened and 67% from crumpled state, compared to ≤8% for vanilla BC (Chen et al., 29 Sep 2025).

These results attest to SAR’s superiority in sample efficiency, stability, progression fidelity, and safety across diverse RL pipelines.

Domain	Key SAR Approach	Reported Gains
Acrobatic RL (Kim et al., 24 Sep 2024)	CoMOPPO per-stage	+30% flip success, 0 collision, 40% lower stage-time var.
Manipulation (Escoriza et al., 3 Mar 2025)	Demo-aug. MB RL SAR	40–70% higher data efficiency
Robot Reaching (Peng et al., 2020)	Soft/Hard SAR	46.9% faster convergence, 99.6% success
Folding (Chen et al., 29 Sep 2025)	SARM + RA-BC	83% fold (flat), 67% (crumpled), $\rho=0.94$ on rollouts

6. Generalization and Task Applicability

SAR generalizes to any sequential skill where the objective can be decomposed into substages, whether they are defined by task semantics, motion features, or domain priors. Examples include:

Pick-and-place: “move to grasp,” “grasp stabilization,” “transport,” “release” (Peng et al., 2020, Kim et al., 24 Sep 2024)
Legged and aerial acrobatics: distinct flight, takeoff, and landing phases (Kim et al., 24 Sep 2024)
Deformable object manipulation: semantically segmented stages (unfolding, aligning, folding) (Chen et al., 29 Sep 2025)
General long-horizon RL where exploration and reward delays inhibit learning

There is no fundamental limit on the number of stages; tasks may involve few or many stages, as long as transitions can be defined or learned (Kim et al., 24 Sep 2024). Automated stage discovery, for example by clustering state-action distributions or using auxiliary “phase” networks, is an open research direction.

7. Future Directions, Limitations, and Open Problems

Current SAR implementations rely on task-specific scheduler design, demonstration-quality for progress prediction, and accurate cost/reward calibration. Limitations include possible brittleness to scheduler mis-specification, handling of stochastic transitions, and the need for semantically aligned annotations in data-rich approaches. Open problems include:

Automated, robust, or unsupervised stage discovery
Integration of SAR with language-based goal specification
Extension to multi-agent or hierarchical RL with cross-stage dependencies

A plausible implication is that as more complex, long-horizon robotic or embodied intelligence tasks emerge, SAR-style decompositions—combined with structured reward modeling—will become a common paradigm for both policy optimization and scalable imitation learning (Chen et al., 29 Sep 2025, Kim et al., 24 Sep 2024).