PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

Published 19 Jun 2026 in cs.RO, cs.AI, and cs.LG | (2606.21139v1)

Abstract: Latent action pretraining learns representations of visual change from pairs of observations, but existing methods typically encode each transition as a single unstructured representation that entangles transition extent and transition mode. We introduce Polar Latent Actions with Radial structure (PoLAR), which imposes a radial-direction structure on latent actions, encouraging radius to encode transition extent and direction to retain transition mode. PoLAR uses temporal offset between two observations as a weak proxy for transition extent, encouraging latent action from observation pairs separated by larger temporal gaps to occupy larger radii. We instantiate this structure in hyperbolic space, whose expanding volume with radius offers a natural fit for more diverse transition modes at larger extents. Across in-task and large-scale pretraining settings, PoLAR improves downstream policy performance in simulation and real-world robot experiments, outperforming latent action baselines and strong pretrained VLAs. These results suggest that the geometry of the latent action space is an important design choice for transferring visual pretraining to downstream robot policy learning.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces PoLAR, which factorizes latent actions into radial extent and directional mode to enhance robot policy generalization.
It employs radial ordering and hyperbolic geometry to align transition extents with temporal offsets, improving action informativeness and robustness.
Experimental results demonstrate PoLAR's superior performance in simulated and real-world robot manipulation tasks relative to baseline methods.

PoLAR: Factorizing Extent and Mode in Latent Actions for Robot Policy Learning

Motivation and Background

Latent action pretraining aims to extract compact codes representing transitions between states, supporting both policy learning and world modeling in robotics. Traditional methods encode transitions as unstructured latent vectors, entangling transition extent (how much changes) and transition mode (what changes). This entanglement impairs policy generalization, particularly across varying horizons, since similar transition modes executed at different extents are not explicitly related within the latent space.

PoLAR (Polar Latent Actions with Radial structure) proposes a geometric factorization: utilizing radius in latent space to encode transition extent and direction to encode mode. By leveraging temporal offset between observation pairs as a weak proxy for transition extent, PoLAR imposes a radial ordering such that larger temporal gaps correspond to latent actions with larger radii. This separation facilitates more robust and interpretable policy learning. The model further instantiates this structure within a hyperbolic geometry, exploiting the exponential angular capacity at larger radii to support a greater diversity of transition modes.

Methodology

Radially Structured Latent Actions

PoLAR follows the canonical latent action pipeline: an inverse dynamics model (IDM) encodes visual changes from pairs of observations, subsequently quantized for discrete latents if needed. The forward dynamics model (FDM) reconstructs future states from the current observation and latent action. PoLAR augments this pipeline with two critical innovations:

Radial losses: Imposed as $\mathcal{L}_{\mathrm{ord}}$ (ordering transitions by anchor-relative distance) and $\mathcal{L}_{\mathrm{rad}}$ (ensuring radius increases proportionally with temporal offset), these losses work in tandem: $\mathcal{L}_{\mathrm{ord}}$ anchors transition extent locally, while $\mathcal{L}_{\mathrm{rad}}$ orders extents globally.
Hyperbolic geometry: Latent actions are lifted from the tangent space to the Poincaré ball. The radius and direction are computed hyperbolically, providing increased angular capacity for encoding diverse transition modes at larger extents.

For discrete latent actions, PoLAR uses a factorized codebook with $R$ radii and $D$ direction codes. Each latent action is quantized into one shared radial index and $C$ direction indices per latent slot, enforcing a sharp factorization between extent and mode.

Experimental Evaluation

Task Suite and Simulation

PoLAR is evaluated across simulated and real-world tabletop manipulation tasks including RoboMimic, MimicGen, SimplerEnv-WidowX, and real robot executions.

Figure 1: Evaluation tasks for PoLAR, spanning multiple simulated and real-world manipulation domains.

PoLAR consistently outperforms flat and Euclidean baselines in simulated policy learning. For continuous latent action-conditioned diffusion policies, PoLAR boosts success rates over both radial-agnostic and purely Euclidean models.

Figure 2: PoLAR delivers superior performance in simulation, with marked gains versus both baselines and pretrained VLAs.

Real-World Robot Manipulation

Real robot experiments show PoLAR with VLA achieves the highest success across all evaluated tasks. Gains are not limited to early-stage subtasks but persist across compound manipulations, evidencing practical robustness.

Figure 3: PoLAR with VLA achieves highest real-world success rates across multiple manipulation tasks.

Role of Radial Structure

Temporal offset strongly correlates with physical state change; PoLAR converts this weak supervision into a radial latent structure. As temporal offset increases, PoLAR radii exhibit ongoing progression, contrasting with flat baselines that capture little radial variation.

Figure 4: Temporal offset reliably proxies for transition extent, and PoLAR radii increase accordingly; flat baselines lack such radial organization.

Qualitative Decoder Analysis

Fixing direction tokens and sweeping radius produces progressively larger visual transitions, confirming that PoLAR's radial token explicitly modulates transition extent.

Figure 5: Radius token controls transition extent; increasing radial token yields larger decoded transitions with fixed mode.

Policy Learning Advantages

Action informativeness: PoLAR latent actions contain the most mutual information and explain highest action variance relative to ground-truth actions, outperforming both factorized and unstructured baselines.
Error robustness: When token prediction errors occur, PoLAR's factorization ensures mispredicted tokens remain proximate in latent space, translating to smaller decoded action errors.
Multi-horizon efficiency: PoLAR benefits markedly from multi-horizon latent policy training, with cross-horizon action targets inducing less conflicting gradients due to shared directional structure and variable radius.

Geometry Choice: Hyperbolic vs Euclidean

Both hyperbolic and Euclidean variants of PoLAR learn radii reflecting temporal offset, but hyperbolic geometry provides exponentially increasing angular capacity at larger radii. This supports greater diversity among transition modes without sacrificing action-predictive information. Hyperbolic PoLAR demonstrates higher action informativeness and yields better downstream policy performance.

Failure Modes and Robustness

Representative rollouts highlight PoLAR's successes in complex, real-world manipulations. Failure analyses reveal PoLAR errors tend to cluster in grasp failures or incomplete sequential actions. Baseline method failures are broader, including incorrect object selection and more frequent task incompletions.

Figure 6: Representative successful real-world rollouts for PoLAR.

Figure 7: Observed PoLAR failures typically involve grasping or incomplete sequential actions.

Figure 8: Baseline failures span broader error modes, including incorrect object selection and failure in sequential manipulation tasks.

Additional radius-sweep visualizations further reinforce the decoupling of extent and mode, with radial manipulation yielding controlled, larger transitions.

Figure 9: Additional radius sweep examples showing larger visual transitions for increasing radii, mode preserved.

Implications and Future Directions

The explicit factorization of transition extent and mode within latent actions, instantiated via hyperbolic geometry, addresses a principal shortcoming in prior latent action models. Practically, this facilitates policy transfer across varying horizons and tasks, yielding more robust downstream policies in both simulation and reality. Theoretically, PoLAR demonstrates that geometric inductive biases in latent spaces improve action informativeness, robustness, and policy generalization.

Future work should extend PoLAR to multi-view and wrist-camera scenarios, as well as integrate stronger temporal supervision for cyclic or non-progressive behaviors. The separation of mode and extent offers promising avenues for compositional latent action design (cf. (2604.03340)), algebraic consistency (Tang et al., 11 May 2026), and rotational representation (Li et al., 13 May 2026). As geometric latent action modeling matures, integrating PoLAR structure with large-scale VLAs, world models, and embodied RL settings could structurally enhance robot skill acquisition and cross-embodiment transferability.

Conclusion

PoLAR introduces and validates a geometric latent action factorization for robot policy learning, separating transition extent and mode via radial-direction encoding and temporal supervision. Empirical results in both simulation and real robotic platforms underline PoLAR's superiority in success rates, action informativeness, and robustness. This work demonstrates that latent action space geometry is a salient lever for improving downstream control, with implications for broader adoption in embodied AI and vision-language-action modeling (2606.21139).

Markdown Report Issue