Position Reinforcement in RL

Updated 4 July 2026

Position reinforcement is the use of RL to modulate positional feedback and control in tasks ranging from robot pose adjustment to screen-slot allocation.
It employs methods such as residual feedback learning, state-dependent stiffness, and impedance adaptation to optimize position-sensitive decision processes.
Design patterns include bounded action spaces and sharply defined reward functions that directly tie positional attainment to long-horizon task success.

Position reinforcement denotes a family of reinforcement-learning formulations in which the learned policy regulates positional behavior, position-dependent decisions, or the effective position-related objectives seen by a controller. In the cited literature, the term does not refer to a single standardized algorithm. Instead, it spans several technical uses: reshaping position or pose feedback in contact-rich manipulation, learning state-dependent stiffness around position controllers, selecting where to look in active visual exploration, allocating feed slots in industrial recommender systems, maintaining central trajectories without explicit localization, regulating inventory position in market making, and optimizing finishing position in race strategy (Ranjbar et al., 2021, Kim et al., 2021, Pardyl et al., 2024, Shi et al., 2023, Martini et al., 2022, Bakshaev, 2020, Thomas et al., 7 Jan 2025). This suggests that position reinforcement is best understood as an RL-centered design pattern in which “position” may mean Cartesian pose, spatial glimpse placement, screen-slot allocation, vehicle placement, inventory, or rank position, depending on the problem domain.

1. Conceptual scope and semantic variants

Across the literature, position reinforcement is consistently tied to the use of RL for modifying either a positional target, a positional behavior, or a position-sensitive decision process. In contact-rich robotics, the emphasis is literal physical position: RL modifies pose feedback, stiffness, or force-position trade-offs so that a robot reaches and maintains useful contact configurations despite uncertainty (Ranjbar et al., 2021, Chang et al., 2022). In active visual exploration, it means learning glimpse position and scale over an image with continuous actions (Pardyl et al., 2024). In multi-channel feeds, it denotes allocating channel types to visible positions on a screen (Shi et al., 2023). In autonomous navigation, it denotes learning central, goal-directed motion without explicit localization inputs (Martini et al., 2022). In market making and Formula One strategy, “position” means inventory and finishing order, respectively (Bakshaev, 2020, Thomas et al., 7 Jan 2025).

Domain	Meaning of position reinforcement	Representative paper
Contact-rich manipulation	Modify position/pose feedback, residuals, or impedance	(Ranjbar et al., 2021)
Dexterous manipulation	Learn stiffness from position-control experiences	(Kim et al., 2021)
Active visual exploration	Select glimpse position and scale	(Pardyl et al., 2024)
Feed allocation	Assign content channels to screen positions	(Shi et al., 2023)
Vineyard navigation	Maintain centered trajectory without localization	(Martini et al., 2022)
Cavitation control	Drive a bubble to a target position	(Klapcsik et al., 2023)
Object pushing	Reach and keep a final object position within 1 cm	(Bergmann et al., 2024)
Market making	Penalize and regulate inventory position	(Bakshaev, 2020)
Race strategy	Improve finishing position through pit decisions	(Thomas et al., 7 Jan 2025)

A recurring misconception is that position reinforcement is synonymous with position-only control. The surveyed work shows the opposite: in many contact-rich settings, pure position control is explicitly described as brittle, unsafe, or insufficient, and RL is introduced precisely to augment position control with feedback shaping, stiffness modulation, force regulation, or hybrid force-position behavior (Kim et al., 2021, Zhi et al., 27 May 2025). Another misconception is that the term is inherently robotic. The feed-allocation, market-making, race-strategy, and explainability papers use the same vocabulary of “position” in non-robotic but still sequential decision-making settings (Shi et al., 2023, Bakshaev, 2020, Thomas et al., 7 Jan 2025, Krajna et al., 2022).

2. Feedback shaping, impedance adaptation, and hybrid force-position control

The most explicit formulation of position reinforcement in robotics appears in residual feedback learning for peg insertion. Standard residual policy learning adds a learned residual to a controller output,

$\mathbf{u}_t = \mathbf{u}_c(\mathbf{s}_t) + \mathbf{r}_\theta(\mathbf{s}_t),$

but the paper argues that internal feedback loops can treat the residual as an external disturbance to be rejected, especially when the controller runs at $1\ \mathrm{kHz}$ and the RL policy at $\sim 40\ \mathrm{Hz}$ . Residual Feedback Learning instead modifies the controller’s effective feedback, creating a virtual position or pose so that the controller cooperates with the RL correction rather than fighting it. In the paper’s position-reinforcement interpretation, RL computes $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ and uses $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ , thereby reshaping the position objective in task space (Ranjbar et al., 2021). Hybrid Residual Reinforcement Learning then combines residual feedback and residual torque action in a $14$-dimensional joint-space action, pairing wide spatial steering with decisive micro-actions for jam release (Ranjbar et al., 2021).

A closely related but distinct approach appears in SCAPE, where the central issue is not modifying the perceived position but learning state-dependent stiffness around a position-control policy. SCAPE starts from the observation that most RL systems in dexterous manipulation control only pose while keeping internal gains fixed. It augments position-control demonstrations with passive stiffness labels $k_{\text{passive}}$ , then learns stiffness control with DDPG + HER, a Q-filter, and an imitation regulator. The control law is standard task-space impedance,

$F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$

with stiffness modulation applied primarily along the grasp axis. The paper’s claim is that, for fragile or frictional contact, position-only RL is unsafe and brittle, whereas state-dependent stiffness control reduces forces and improves robustness to uncertainty (Kim et al., 2021).

Impedance adaptation by RL with Contact DMPs makes the same position–force trade-off explicit at the level of stiffness scheduling. DMPs encode both position and force trajectories from demonstration, and SAC adapts $K^p$ and $K^o$ online so that high stiffness reinforces position tracking in free or slip-prone axes, while low stiffness prioritizes force tracking in contact directions. The admittance equation,

$1\ \mathrm{kHz}$ 0

is used to explain why high $1\ \mathrm{kHz}$ 1 yields small $1\ \mathrm{kHz}$ 2 and thus $1\ \mathrm{kHz}$ 3, whereas low $1\ \mathrm{kHz}$ 4 encourages $1\ \mathrm{kHz}$ 5 (Chang et al., 2022). The same paper reports that SAC-based impedance adaptation improves robustness over fixed impedance on adhesive strip application, including under a $1\ \mathrm{kHz}$ 6 offset (Chang et al., 2022).

The more recent unified policy for legged loco-manipulation generalizes the same theme: RL co-learns position and force control without force sensors by embedding force commands into impedance-style target definitions. The end-effector target is

$1\ \mathrm{kHz}$ 7

and the base target velocity is

$1\ \mathrm{kHz}$ 8

The policy estimates forces from observation history and compensates them through joint-target residuals tracked by a PD controller, yielding position tracking, force application, force tracking, and compliant interaction within one policy (Zhi et al., 27 May 2025).

Door opening with a mobile manipulator provides another variant. The non-learning controller uses a velocity-level position–force law

$1\ \mathrm{kHz}$ 9

where the force-feedback term is a PI compensator driven by wrist wrench limits, while SAC learns joint and base velocities that minimize force, smoothness penalties, and time. The RL system reduces the maximum force required by $\sim 40\ \mathrm{Hz}$ 0 times and improves motion smoothness by $\sim 40\ \mathrm{Hz}$ 1 times on the reported push–CCW door experiments, but the adaptive position–force controller is described as more versatile across door directions and widths (Kang et al., 2023).

3. State, action, and reward design patterns

Despite large domain variation, the surveyed position-reinforcement systems share a small set of recurrent design choices: compact state abstractions, bounded action spaces, and reward functions that tie long-horizon success to positional attainment or position-sensitive risk.

In robotic manipulation, sparse success rewards are common. Peg insertion with residual feedback learning uses $\sim 40\ \mathrm{Hz}$ 2 only if $\sim 40\ \mathrm{Hz}$ 3 with $\sim 40\ \mathrm{Hz}$ 4, and otherwise $\sim 40\ \mathrm{Hz}$ 5. The same system uses RGB images plus pose and wrench in the vision phase, and relative end-effector position, Euler angles, and wrench in the contact phase, with PPO in PyTorch, a CNN for vision, and a shared LSTM for contact-rich control (Ranjbar et al., 2021). Precision-focused pushing adopts an even stricter fixed-horizon objective: the reward is $\sim 40\ \mathrm{Hz}$ 6 whenever the object lies outside a $\sim 40\ \mathrm{Hz}$ 7 ball around the goal and $\sim 40\ \mathrm{Hz}$ 8 otherwise. Because episodes do not terminate early, overshoot is penalized implicitly by additional time spent outside tolerance. The policy uses a GRU over latent object and goal masks plus end-effector position, and the action includes both planar position offsets and a timing variable $\sim 40\ \mathrm{Hz}$ 9 that sets how long the commanded offset is held (Bergmann et al., 2024).

In classical low-dimensional control, the reward-design issue is isolated directly. The cart-position study compares three Q-learning rewards and concludes that a discontinuous threshold reward gives the best regulation performance. Under that reward, the learned controller reaches exactly $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 0 after $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 1 and remains there, whereas the quadratic and piecewise-linear rewards produce oscillation or off-target steady behavior (Mukherjee, 2021). The cavitation-bubble controller uses a different shaped reward,

$\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 2

with continuous DDPG actions $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 3 in $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 4, and reports $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 5 success over a $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 6 grid of initial and target positions (Klapcsik et al., 2023).

In navigation and flight control, positional rewards are combined with shaping terms that reflect geometry or control smoothness. Vineyard navigation uses heading alignment $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 7, distance progress $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 8, and sparse terminal rewards $\Delta \mathbf{x}_\theta(\mathbf{s}_t)$ 9 for success and $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 0 for collision or reverse yaw breach, all within a maximum-entropy SAC objective (Martini et al., 2022). The thrust-vector quadrotor controller uses a dense two-term distance reward with $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 1 and $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 2, while pushing low-level control into a PID attitude loop; the learned thrust-vector policy reaches approximately $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 3 mean reward after $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 4 steps on randomized target tracking, whereas the direct RPM baseline reaches approximately $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 5 (Mahran et al., 20 Dec 2025). Energy-aware AUV control instead penalizes absolute position errors, orientation error magnitude, action changes, and total thruster usage with different $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 6 weights in TQC-HP and TQC-EA (Boré et al., 25 Feb 2025).

In strategic and allocation settings, rewards are position-sensitive in a different sense. MDDL uses GMV as the RL signal for screen-level feed allocation and adds imitation on strategy data through the Weighted Exposure Ratio, a position-aware scalar built from exposure probabilities and per-position CTR (Shi et al., 2023). Formula One strategy maps finishing position to $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 7 FIA points, penalizes invalid tyre actions by $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 8, penalizes extra stops after the first valid pit by $\mathbf{e}_x' = \mathbf{x}_d - (\mathbf{x} - \Delta \mathbf{x}_\theta)$ 9, and gives $14$0 otherwise, thereby turning final rank position into the dominant terminal return (Thomas et al., 7 Jan 2025). Market making introduces an exponential inventory penalty,

$14$1

so that position reinforcement becomes explicit inventory compression inside SAC (Bakshaev, 2020).

4. Spatial exploration, allocation, and positional strategy outside direct control

Position reinforcement is not limited to controlling a mechanical end effector. In active visual exploration, the controlled position is the observation itself. AdaGlimpse formulates glimpse placement as a continuous-control MDP with action

$14$2

where $14$3 and $14$4 are normalized top-left coordinates and $14$5 is normalized scale. The reward is $14$6, so the policy is positively reinforced for selecting positions and scales that reduce task loss. The reported behavior is coarse-to-fine: the first glimpse is wide-scale, then subsequent glimpses zoom into informative regions, improving reconstruction, classification, and segmentation efficiency relative to fixed-grid baselines (Pardyl et al., 2024).

In industrial recommendation, the controlled object is screen position rather than physical pose. MDDL treats feed construction as a slate MDP in which the action is a $14$7-dimensional binary vector specifying whether each slot is occupied by video or graphic-text content. Strategy data and random data are handled differently: strategy data receives a position-aware imitation loss based on WER, and random data receives standard TD learning. Offline, MDDL achieves reward $14$8, AVG-OD $14$9, and STD-OD $k_{\text{passive}}$ 0, a $k_{\text{passive}}$ 1 reward lift over the best baseline; online on Meituan’s food delivery platform it yields CTR $k_{\text{passive}}$ 2 and GMV $k_{\text{passive}}$ 3 during the reported A/B test (Shi et al., 2023).

In vehicular networking, the “position” being reinforced is the placement of relay-capable vehicles relative to blockers. The mmWave V2X paper models each controllable vehicle as an A3C agent acting on local feature planes that encode nearby vehicle types and, in PTCL/PTDL, predicted relay lengths for candidate positions. The per-step reward is proportional to current relay length with a penalty for prohibited moves, and the learned policy can increase coverage to about $k_{\text{passive}}$ 4 that of random movement in the reported setting with $k_{\text{passive}}$ 5, $k_{\text{passive}}$ 6, and $k_{\text{passive}}$ 7 (Taya et al., 2018).

Race strategy and market making show a more abstract positional usage. In RSRL, the objective is literal finishing position: the agent pits or stays out lap by lap to maximize FIA-point-shaped return, reaching average finishing position $k_{\text{passive}}$ 8 on the 2023 Bahrain Grand Prix test race versus $k_{\text{passive}}$ 9 for the best baseline (Thomas et al., 7 Jan 2025). In market making, SAC manages hedging and skew so that inventory position remains small unless compensated by spread revenue; the paper explicitly introduces the position penalty to improve convergence and stabilize learning (Bakshaev, 2020). These formulations are structurally different from robot pose control, but they retain the same core pattern: RL is used to regulate an indexed or ordered notion of position over time.

5. Safety, constraints, and characteristic failure modes

A defining property of position-reinforcement methods is that they usually operate inside a pre-existing control, safety, or operational envelope. As a result, many papers devote substantial design effort to bounded actions, rate limits, safe initialization, or selective activation.

Residual feedback learning bounds residual actions and feedback, zero-initializes the last actor layer, runs the policy only in selected FSM states, and relies on the $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 0 impedance loop for smooth actuation. Its central failure mode is controller-feedback conflict: when a controller runs faster than the RL policy, standard residual actions are progressively rejected, and success drops with additional controller-only buffer steps, whereas residual feedback remains robust (Ranjbar et al., 2021). SCAPE handles safety through force penalties, hard fragility thresholds during evaluation, low-pass filtering of quasi-static force estimates, Q-filtering to avoid cloning poor demonstrations, and an imitation regulator that switches to self-imitation after a target success rate. The reported ablations show that RL from scratch fails catastrophically, and that unsafe imitation persists without the Q-filter (Kim et al., 2021).

Several systems enforce safety through explicit physical bounds. Impedance adaptation with DMPs constrains translational stiffness to $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 1, rotational stiffness to $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 2, and rate limits to $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 3 and $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 4 per cycle (Chang et al., 2022). Underwater manipulation under position and torque constraints uses strong penalties for leaving joint limits and environment-side torque saturation; in the reported comparison with MPC it achieves overshoot $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 5 versus $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 6 and settling time $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 7 versus $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 8, but with higher energy $F_t = K(x_t)e_t + D(x_t)\dot e_t,\qquad \tau_t = J(x_t)^\top F_t,$ 9 versus $K^p$ 0 because the reward does not penalize effort explicitly (Carlucho et al., 2020). TQC-based AUV control similarly trades performance against power by changing $K^p$ 1: TQC-HP beats a tuned PID on settling time and some RMSE metrics, whereas TQC-EA consumes approximately $K^p$ 2 less power on average but with lower performance (Boré et al., 25 Feb 2025).

The characteristic failure modes differ by domain but often expose the same underlying theme: a position objective alone is not enough when dynamics are hidden, delayed, or constraint-saturated. In precision pushing, low sliding friction leads to overshoot and rapid corrective movements, motivating the GRU-based architecture and friction-biased sampling (Bergmann et al., 2024). In vineyard navigation, wide plant gaps and extreme depth noise can trigger row switching or failure, even though the agent generally degrades gracefully by slowing down (Martini et al., 2022). In feed allocation, strategy data cause severe overestimation because of state-action imbalance, which is why MDDL separates imitation on strategy data from RL on random data (Shi et al., 2023). In door opening, the SAC policy improves smoothness and force but does not converge when training mixes all opening directions in the reported setup (Kang et al., 2023).

A broader misconception is that adding RL automatically improves robustness. The evidence is more conditional. Several papers explicitly report that training at full difficulty fails, that sparse reward alone is insufficient, or that performance depends critically on curricula, demonstrations, domain randomization, or carefully chosen reward structure (Ranjbar et al., 2021, Kim et al., 2021, Bergmann et al., 2024).

6. Explainability, generalization, and open problems

The position paper on explainability in RL argues that RL explanations cannot be treated as a trivial extension of supervised XAI because of credit assignment, delayed rewards, non-i.i.d. data, exploration–exploitation trade-offs, and partial observability. It proposes a taxonomy spanning scope, timing, time horizon, environment type, policy type, and agent cardinality, and emphasizes three pillars for honest explanation: proactivity, risk attitudes, and epistemological constraints (Krajna et al., 2022). This framework is directly relevant to position reinforcement because many of the surveyed systems are sequential controllers whose positional decisions only become intelligible when explained over trajectories rather than at single time steps.

Some application papers operationalize this requirement. RSRL supplements its DRQN policy with TimeSHAP feature importance, VIPER surrogate trees, and counterfactuals. On Bahrain 2023, the surrogate reaches accuracy $K^p$ 3 and F1 score $K^p$ 4 over $K^p$ 5 random simulations, while the counterfactual analysis yields an average of $K^p$ 6 features changed and average distance $K^p$ 7 to reach the closest counterfactual (Thomas et al., 7 Jan 2025). These explanations are framed in terms of gaps ahead and behind, tyre degradation, and race progress, thereby exposing how finishing-position optimization is mediated by interpretable state variables rather than opaque Q-values. MDDL, although not an explainability paper, similarly treats overestimation indicators AVG-OD and STD-OD as quantities that must be monitored because inflated values can silently corrupt position allocation (Shi et al., 2023).

Generalization remains uneven across the literature. Some methods transfer across uncertainty levels, embodiments, or platforms: the unified legged force-position policy is demonstrated on both a quadrupedal manipulator and a humanoid robot and improves downstream imitation-learning success rates by approximately $K^p$ 8 across four tasks (Zhi et al., 27 May 2025); SCAPE transfers directly from MuJoCo to the NuFingers testbed without fine-tuning (Kim et al., 2021); the vineyard policy generalizes across Jackal and Husky UGVs (Martini et al., 2022). Other methods remain explicitly preliminary or simulation-only, such as the TQC AUV controller (Boré et al., 25 Feb 2025) and the quadrotor thrust-vector controller (Mahran et al., 20 Dec 2025).

Several open questions recur. One is whether feedback shaping, gain shaping, and force estimation can be unified more systematically; some papers identify gain shaping as a natural extension but do not implement it (Ranjbar et al., 2021). Another is whether sparse but precise position rewards should be preferred over dense shaping; the cart-position study favors discontinuous threshold rewards, whereas other domains rely on dense penalties or delta-loss shaping (Mukherjee, 2021, Pardyl et al., 2024). A third is whether position reinforcement should remain a domain-specific term or become a broader methodological category. The present literature supports the latter only in a loose sense: the common denominator is not a single algorithm but the repeated use of RL to reshape, allocate, or stabilize position-sensitive objectives under uncertainty, constraints, or delayed consequences (Krajna et al., 2022).