Posterior Optimization with Clipped Objective for Bridging Efficiency and Stability in Generative Policy Learning

Published 2 Apr 2026 in cs.RO | (2604.01860v1)

Abstract: Expressive generative models have advanced robotic manipulation by capturing complex, multi-modal action distributions over temporally extended trajectories. However, fine-tuning these policies via RL remains challenging due to instability and sample inefficiency. We introduce Posterior Optimization with Clipped Objective (POCO), a principled RL framework that formulates policy improvement as a posterior inference problem tailored for temporal action chunks. Through an Expectation-Maximization procedure, POCO distills a reward-weighted implicit posterior into the policy without likelihood estimation. Furthermore, POCO adopts an offline-to-online paradigm that anchors online exploration to pre-trained priors, and its model-agnostic design scales to fine-tune large VLA models without architectural modifications. Evaluations across 7 simulation benchmarks and 4 contact-rich real-world tasks demonstrate that POCO prevents catastrophic policy collapse, outperforms SOTA baselines, and achieves a 96.7% success rate on real-world tasks. Videos are available at our project website https://cccedric.github.io/poco/.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces POCO, which reframes policy improvement as posterior inference with a robust clipped surrogate objective to stabilize fine-tuning.
It employs an implicit E-M procedure over temporal action chunks, ensuring sample efficiency and effective adaptation in offline-to-online settings.
Experimental results demonstrate POCO's enhanced success rates and stability over baselines in both simulated and real-world robotic manipulation tasks.

Posterior Optimization with Clipped Objective: Sample-Efficient and Stable Policy Improvement for Generative Robotic Policies

Introduction and Problem Setting

The POCO framework targets the offline-to-online adaptation of high-capacity, likelihood-free generative policies—such as diffusion or flow matching models and Vision-Language-Action (VLA) models—for complex robotic manipulation. Prior RL methods for such policies are limited by a stability-efficiency dilemma: off-policy approaches are sample-efficient but exacerbate policy collapse due to OOD value overestimation, while on-policy methods are stable but impractically data-inefficient in physical systems. This work formulates policy improvement as a posterior inference problem over temporal action chunks, distilling reward-weighted posteriors via a clipped objective, thereby addressing the conflicting needs of efficient data usage and stable learning.

Figure 1: Conceptual overview contrasting off-policy, on-policy, and POCO's sample-efficient and stable improvement paradigm.

Theoretical Framework and Methodology

From Policy Optimization to Posterior Inference

POCO recasts RL policy improvement as inference, applying a maximum-entropy formulation to express reward-weighted trajectory optimality. The framework consistently treats the current policy as a prior over action chunks and refines this prior through a reward-informed posterior via an Expectation-Maximization (E-M) process. This yields a variational ELBO objective for policy adaptation, replacing explicit single-step likelihood computations with chunk-level reward-weighted inference.

To render inference tractable for expressive policies lacking closed-form likelihoods, POCO utilizes implicit posterior evaluation: it samples candidate action chunks from the current policy and assigns them normalized importance weights exponentially proportional to their critic Q-value estimates. This strategy supports efficient credit assignment in high-frequency, temporal-chunked action spaces, addressing the deficiencies of single-step updates in preventing compounding errors.

Figure 2: The POCO framework: Stage I—offline expert demonstration pre-training; Stage II—sample-based E-M fine-tuning via implicit posterior and clipped objective.

POCO Loss and Stability Guarantee

The update operator is constructed to be model-agnostic, supporting all expressive likelihood-free generative models. The M-step of E-M is translated into a supervised regression loss with a clipped surrogate objective: the gradient step size attributable to high-weight, potentially OOD, actions is explicitly capped, mathematically enforcing a local trust region around the data manifold and preventing catastrophic collapse of the pre-trained prior. Crucially, the loss includes both a BC (behavior cloning) regularizer and a posterior guidance term weighted by a tunable scaling hyperparameter $\beta$ , affording explicit control over the exploitation-exploration regularization.

This approach is robust to inaccuracies in Q-value estimation during the initial online adaptation phase, as the clip threshold $\zeta$ and guidance scale $\beta$ can be tuned to minimize performance degradation from overestimated or noisy sample rewards.

Experimental Evaluation

The efficacy of POCO is demonstrated across comprehensive simulation benchmarks ("scene", "puzzle-3x3", "cube-double/triple", "lift", "can", "square") and real-world, contact-rich manipulation tasks (Pick Cube, Route Cable, Insert USB, Assemble SSD, among others). The evaluation protocol covers both pure online RL and offline-to-online fine-tuning settings, with particular attention to success rate, sample efficiency, and the stability of learned trajectories.

Figure 3: Simulated task benchmarks exhibiting varying action dimensionality, sparse-reward dynamics, and long-horizon requirements.

Figure 4: Real-robot platform featuring human-in-the-loop reward assignment and hardware safety mechanisms.

Figure 5: Diverse real-world manipulation tasks, including precision insertion and deformable object handling.

POCO outperforms relevant state-of-the-art methods:

In online RL, it exhibits notably higher sample efficiency and rapid convergence, especially on complex long-horizon and sparse-reward tasks where direct Q-gradient-based baselines (e.g., FQL) fail due to instability.
In offline-to-online finetuning, POCO avoids catastrophic forgetting that plagues QC, FQL, and even advanced inference-time control methods (DSRL, ReinFlow), achieving stable monotonic improvements.
For contact-rich real-robot tasks, POCO attains an average success rate of 96.7%, outperforming both BC initialization (52.5%) and other RL fine-tuning methods by a wide margin.
Figure 6: POCO learning curves in online training, indicating superior sample efficiency and avoidance of OOD-induced collapse.

Figure 7: Offline-to-online adaptation stability: POCO preserves prior performance and enables steady improvement.

Figure 8: Real-world fine-tuning: POCO achieves consistently high final performance with rapid, safe policy adaptation.

Ablation Studies

Ablation on the clip threshold $\zeta$ and posterior guidance scale $\beta$ reveal their criticality: excessively large values for either parameter cause over-fitting to noisy Q-values and early policy degradation, while excessively small values preclude meaningful adaptation, effectively reverting to BC. Proper tuning produces robust, monotonic learning dynamics, with safe exploitation of Q-based posterior information and maintenance of performance guarantees.

Figure 9: Ablation on $\zeta$ (clip threshold) in offline-to-online simulation: excessively permissive thresholds lead to instability.

Figure 10: Ablation on $\beta$ (posterior guidance scale): inappropriate weighting can either neuter exploration or destabilize learning.

Implications and Scalability to Foundation Models

A key strength is the model-agnostic, likelihood-free formulation: POCO extends directly to large VLA models (e.g., $\pi_{0.5}$ , GR00T N1.6), requiring only partial adaptation of flow-based action modules while keeping frozen encoders and transformers. Experimentally, this yields substantial improvement over supervised finetuning baseline success rates on challenging real-world tasks (e.g., from 63.3%→86.7% on GR00T N1.6 Pick Pen).

The framework's reliance on critic accuracy spotlights future research directions, including the integration of automated dense reward signals (e.g., world-model-based reward shaping) and structured, prior-driven exploration methods to further enhance sample efficiency and stability for open-ended robotic learning.

Conclusion

POCO offers a theoretically grounded and empirically validated solution to the stability-efficiency trade-off endemic to policy improvement with high-capacity generative models. The implicit E-M inference formulation, combined with a clipped objective, guarantees robust, monotonic adaptation in both simulated and physical domains, including when scaling to billion-parameter VLA architectures. The framework stands as a practical tool for enabling RL-driven, safe, and efficient deployment of generative policies for complex, real-world robotic manipulation, and motivates further work at the intersection of RL, foundation model adaptation, and robust value-based control.

Markdown Report Issue