Self-Referential Policy Optimization

Updated 3 July 2026

Self-Referential Policy Optimization is a framework for Vision-Language-Action models that uses the policy’s current successful trajectories as adaptive references to reward progress.
It encodes trajectory data in a latent world-model space and clusters successful behaviors to assign progress rewards based on proximity, avoiding reliance on external demonstrations.
Empirical results show dramatic gains, with success rates improving from 48.9% to 99.2% on LIBERO, highlighting SRPO’s efficiency and robustness in sparse-reward settings.

Self-Referential Policy Optimization (SRPO) denotes, in its exact arXiv usage, a reinforcement-learning post-training framework for Vision-Language-Action (VLA) models in which the policy uses its own successful trajectories from the current rollout batch as references for rewarding failed trajectories. Rather than relying on external demonstrations during RL or manually engineered intermediate rewards, SRPO encodes trajectories in a pretrained world model’s latent space, clusters successful behaviors, and assigns a progress-wise reward to failures according to their proximity to those successful patterns. The resulting rewards are optimized with a GRPO/PPO-style clipped objective and KL regularization, and the method was introduced for robotic manipulation on LIBERO and LIBERO-Plus (Fei et al., 19 Nov 2025).

1. Terminological scope and acronym ambiguity

In the strict sense of nomenclature, “Self-Referential Policy Optimization” is the exact expansion used by the VLA-RL method in "SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models" (Fei et al., 19 Nov 2025). In the current literature, however, the acronym SRPO is overloaded, and exact expansion matters because several unrelated methods use the same four letters for distinct objectives, domains, and mathematical constructions.

Expansion	Domain	Paper
Self-Referential Policy Optimization	VLA reinforcement learning	(Fei et al., 19 Nov 2025)
Self-Improving Robust Preference Optimization	Offline RLHF / preference optimization	(Choi et al., 2024)
two-Staged history-Resampling Policy Optimization	Cross-domain RL post-training for LLMs	(Zhang et al., 19 Apr 2025)
Self-Reflection enhanced reasoning with Group Relative Policy Optimization	Multimodal reasoning with GRPO	(Wan et al., 2 Jun 2025)
Sample-Routed Policy Optimization	RLVR for LLM post-training	(Li et al., 2 Apr 2026)
State Regularized Policy Optimization	RL under dynamics shift	(Xue et al., 2023)

This ambiguity is not merely terminological. The VLA method titled “Self-Referential Policy Optimization” is centered on sparse-reward embodied control and self-derived trajectory references, whereas the preference-optimization, multimodal-reasoning, RLVR, and dynamics-shift methods use the same acronym for different mechanisms and different formal objects (Fei et al., 19 Nov 2025). Related work outside the acronym itself also contributes to the broader self-referential landscape, including self-generated preference optimization, comparison-conditioned preference optimization, in-context self-optimization, closed-loop optimizer evolution, and retrieval-conditioned self-reference (Lee et al., 27 Jul 2025, Li et al., 29 Dec 2025, Yu et al., 2 Mar 2026, Liu et al., 25 Apr 2026, Zhao et al., 2023).

2. Problem setting in VLA reinforcement learning

SRPO was proposed for a setting in which VLA models already perform robotic manipulation well under supervised training, yet remain constrained by heavy reliance on expert demonstrations, demonstration bias, and severe reward sparsity during RL post-training (Fei et al., 19 Nov 2025). In this formulation, binary task success indicators waste informative structure in failed trajectories: long-horizon rollouts are expensive, many failures contain partial progress, and sparse $0/1$ rewards make group-based optimization inefficient.

The method assumes a policy

$\pi_\theta(a_t \mid o_t, l),$

where $o_t$ is the observation and $l$ is the language goal. The environment is written as

$o_t = O(z_t), \qquad a_t \sim \pi_\theta(\cdot \mid o_t, l), \qquad z_{t+1} \sim E(\cdot \mid z_t, a_t),$

with $z_t$ denoting the latent environment state. A trajectory therefore has the form

$\{(z_0, o_0, a_0, z_1, o_1, a_1, \ldots, z_T, o_T)\}.$

SRPO is applied as a post-training method rather than as a from-scratch RL algorithm. The reported pipeline begins with one-shot SFT from an official OpenVLA checkpoint using one trajectory per task, followed by online RL post-training. For the reported simulation experiments, the policy receives only third-view image observations and a language instruction (Fei et al., 19 Nov 2025). The initial one-shot supervised model is important because self-reference requires the current policy to generate at least some successful trajectories that can serve as references.

Conceptually, SRPO targets a gap between two prior regimes. On one side are sparse-reward GRPO-style VLA methods that use binary terminal success and discard most information in failures; on the other are denser process-reward methods that often require expert demonstrations, handcrafted subtask decompositions, or domain-specific reward engineering. SRPO is explicitly designed to provide denser progress supervision without external demonstration references during RL and without manual intermediate reward design (Fei et al., 19 Nov 2025).

3. Self-referential reward construction

The defining operation in SRPO is to treat the policy’s own successful trajectories from the current batch as references for scoring failed ones (Fei et al., 19 Nov 2025). Let the environment reward be $R(z_{0:T}, l)$ . The successful observation trajectories in the current rollout batch are

$\mathcal{S} = \left\{ o^{(i)}_{0:T} \; ; \; R(z^{(i)}_{0:T}, l)=1,\ \forall i \right\}.$

Trajectories outside this set are treated as failures.

Each trajectory is encoded by a pretrained world-model encoder $\mathcal{W}$ : $\pi_\theta(a_t \mid o_t, l),$ 0 The successful trajectory representations are then clustered with DBSCAN, producing representative success centers $\pi_\theta(a_t \mid o_t, l),$ 1. For any trajectory $\pi_\theta(a_t \mid o_t, l),$ 2, SRPO computes the squared Euclidean distance to the nearest success center: $\pi_\theta(a_t \mid o_t, l),$ 3

The trajectory-level reward is then defined as

$\pi_\theta(a_t \mid o_t, l),$ 4

where $\pi_\theta(a_t \mid o_t, l),$ 5 maps into $\pi_\theta(a_t \mid o_t, l),$ 6, and $\pi_\theta(a_t \mid o_t, l),$ 7 and $\pi_\theta(a_t \mid o_t, l),$ 8 are the mean and standard deviation of failed-trajectory distances. In the implementation described in the paper, $\pi_\theta(a_t \mid o_t, l),$ 9 is a sigmoid, and the progress reward is scaled with coefficient $o_t$ 0, with $o_t$ 1 reported as best (Fei et al., 19 Nov 2025).

Two aspects of this construction are central. First, the reward is trajectory-level rather than fine-grained stepwise shaping; the paper explicitly states that it opts for trajectory-level rewards. Second, the comparison is performed in a latent world representation rather than in raw pixel space. The method uses V-JEPA 2 as the pretrained world model, and the rationale given is that latent world representations are compressed, behaviorally meaningful, more transferable, and better aligned with progress patterns across environments than either raw pixels or general-purpose image embeddings such as ImageBind (Fei et al., 19 Nov 2025).

This mechanism makes the “self-referential” label precise. The policy is not asked to imitate external expert trajectories during RL. Instead, it uses its own current successes as the reference manifold against which failures are evaluated. A failure that is behaviorally close to successful modes receives a higher reward than one that is distant, so failed rollouts become informative rather than uniformly negligible.

4. Optimization objective and training pipeline

Once rewards $o_t$ 2 have been assigned, SRPO follows a GRPO/PPO-style policy optimization scheme (Fei et al., 19 Nov 2025). The policy ratio is

$o_t$ 3

Trajectory rewards are normalized within the group to obtain advantages: $o_t$ 4 The paper also writes group statistics in terms of $o_t$ 5: $o_t$ 6

$o_t$ 7

The clipped surrogate objective is

$o_t$ 8

and KL regularization to a reference policy is written as

$o_t$ 9

The overall SRPO objective is given as

$l$ 0

Operationally, the training loop is straightforward. A one-shot SFT policy is rolled out online; trajectories are partitioned into successes and failures; all trajectories are embedded by the pretrained world model; successful embeddings are clustered; each failed trajectory receives a progress-wise reward from its nearest success cluster center; rewards are normalized into group-relative advantages; and the policy is updated by the clipped objective with KL regularization (Fei et al., 19 Nov 2025). The paper reports 8 samples per group, batch size 64 for training, mini-batch size 128, and learning rate $l$ 1 for SRPO post-training.

A related real-world variant replaces online RL with offline RL integrating Advantage-Weighted Regression (AWR) while retaining the self-referential progress-reward idea. In that setting the cumulative reward $l$ 2, incremental progress

$l$ 3

and normalized advantage

$l$ 4

are used for offline optimization on real-robot data (Fei et al., 19 Nov 2025).

5. Empirical performance, ablations, and reward quality

The headline empirical results are reported on LIBERO and LIBERO-Plus using a modified OpenVLA backbone, OpenVLA* (Fei et al., 19 Nov 2025). On LIBERO, the one-shot supervised baseline OpenVLA*-One achieves an average success rate of 48.9. The paper then reports Offline SRPO at 92.5 average success and Online SRPO at 99.2 average success. The full LIBERO suite numbers for Online SRPO are 98.8 on Spatial, 100.0 on Object, 99.4 on Goal, and 98.6 on Long, corresponding to an average gain of +50.3 points over the one-shot baseline. The abstract describes the change from 48.9% to 99.2% as a 103% relative improvement, achieved in just 200 RL steps.

Training efficiency is emphasized. The paper reports reaching strong performance in 79 steps on Spatial, 59 on Object, 103 on Goal, and 219 on Long. This is presented as especially important for long-horizon robotic RL, where sparse rewards usually make failures computationally expensive and statistically uninformative (Fei et al., 19 Nov 2025).

On LIBERO-Plus, SRPO is presented as robust under seven perturbation dimensions: Camera, Robot-Init, Language, Light, Background, Noise, and Layout. In the zero-shot setting, the paper reports 59.6 total for Online SRPO, versus 19.4 for OpenVLA*-One and 51.1 for OpenVLA*-Full. With augmented data, Online SRPO reaches 82.1, compared with 30.7 for OpenVLA*-One and 73.0 for OpenVLA*-Full. The abstract summarizes the zero-shot robustness gain as a 167% performance improvement on LIBERO-Plus (Fei et al., 19 Nov 2025).

The reward mechanism itself is benchmarked against pixel-level and ImageBind-based progress rewards. On the reported Progress Reward Benchmark, the paper gives the following values:

Method	Key score pattern	Paper
Pixel-level	SC 0.125, Mono 0.498, SMD 2.100	(Fei et al., 19 Nov 2025)
ImageBind	SC 0.957, Mono 0.837, SMD 18.111	(Fei et al., 19 Nov 2025)
SRPO	SC 0.998, Mono 0.992, SMD 188.799	(Fei et al., 19 Nov 2025)

These numbers are used to argue that latent world representations provide more monotonic progress signals and stronger success-failure separation than either direct pixel comparison or general-purpose image embeddings.

The ablations isolate the self-referential mechanism itself. Replacing in-batch successful references with a fixed set of 50 expert trajectories per task still improves over GRPO, but learns more slowly, needs about 1.4× the training steps, and underperforms full SRPO. Removing clustering and instead comparing to the single nearest successful trajectory yields similar early learning but worse later performance, which the paper attributes to the loss of robustness when multiple successful strategy modes emerge. For the progress-reward scaling coefficient, the reported ordering is

$l$ 5

indicating that $l$ 6 best balances progress supervision against terminal success (Fei et al., 19 Nov 2025).

The method is also carried into real-robot experiments on X-ARM 7 tasks including putting apple or pear into a plate, folding towels, cleaning a whiteboard, and Select Poker. In that offline-AWR setting, the paper reports average gains of +66.8% for $l$ 7 and +86.7% for $l$ 8-FAST (Fei et al., 19 Nov 2025).

6. Relation to adjacent self-referential and self-improving methods

The VLA version of SRPO occupies one point within a broader research pattern in which models improve using their own prior outputs, self-generated preferences, or endogenous evaluation signals, but the exact locus of self-reference differs markedly across papers.

Method	Form of self-reference	Paper
Self-Referential Policy Optimization	Successful trajectories in the current batch supervise failed trajectories in latent world space	(Fei et al., 19 Nov 2025)
Self-Improving Robust Preference Optimization	A self-improvement policy $l$ 9 revises the model’s own completion	(Choi et al., 2024)
SGPO	The same model answers, refines its answer, and then learns from the refined-vs-original pair	(Lee et al., 27 Jul 2025)
InSPO	The policy is trained as $o_t = O(z_t), \qquad a_t \sim \pi_\theta(\cdot \mid o_t, l), \qquad z_{t+1} \sim E(\cdot \mid z_t, a_t),$ 0, conditioning on an alternative response	(Li et al., 29 Dec 2025)
ICPO / ME-ICPO	Test-time self-improvement uses histories of previous responses and rewards without parameter updates	(Yu et al., 2 Mar 2026)
Escher-Loop	Optimizer agents improve task agents and themselves in a closed-loop evolutionary system	(Liu et al., 25 Apr 2026)
Self-Reference	The agent retrieves its own past trajectories as a conditioning signal in URL	(Zhao et al., 2023)

Within preference optimization, Self-Improving Robust Preference Optimization is closely related in spirit because it explicitly defines a self-improvement policy $o_t = O(z_t), \qquad a_t \sim \pi_\theta(\cdot \mid o_t, l), \qquad z_{t+1} \sim E(\cdot \mid z_t, a_t),$ 1 that conditions on a model-generated completion $o_t = O(z_t), \qquad a_t \sim \pi_\theta(\cdot \mid o_t, l), \qquad z_{t+1} \sim E(\cdot \mid z_t, a_t),$ 2 and produces an improved completion $o_t = O(z_t), \qquad a_t \sim \pi_\theta(\cdot \mid o_t, l), \qquad z_{t+1} \sim E(\cdot \mid z_t, a_t),$ 3. Its core claim is that preference optimization can be reframed as a min-max game between a generative policy and a self-improvement policy, and it supports recursive inference-time self-revision with the same model (Choi et al., 2024). SGPO likewise unifies policy and improver into a single model, but its main mechanism is on-policy self-generated preference data for DPO, after a one-time external bootstrap of the improver (Lee et al., 27 Jul 2025). InSPO moves self-reference into the policy class itself, optimizing a comparison-conditioned policy $o_t = O(z_t), \qquad a_t \sim \pi_\theta(\cdot \mid o_t, l), \qquad z_{t+1} \sim E(\cdot \mid z_t, a_t),$ 4 that sees an alternative response during training but incurs zero extra inference overhead at deployment (Li et al., 29 Dec 2025).

Other lines push self-reference into the optimizer rather than into pairwise response revision. ICPO and its practical variant ME-ICPO interpret self-reflection as in-context policy optimization: the model uses its own previous responses and self-assessed or externally observed rewards to improve future responses without modifying parameters (Yu et al., 2 Mar 2026). Escher-Loop places the optimizer population itself inside the optimization loop: optimizer agents improve task agents, then use the resulting task performance as the signal by which optimizers themselves are scored and evolved (Liu et al., 25 Apr 2026). At a lower-level control and memory interface, Self-Reference in unsupervised RL retrieves the agent’s own historical trajectories from replay and conditions actor and critic on that retrieved history (Zhao et al., 2023).

A plausible implication is that “self-reference” is best understood not as a single algorithmic template but as a design principle with several realizations: self-revision of outputs, comparison-conditioned preference learning, retrieval over self-history, in-context policy adaptation, and closed-loop optimization of optimizers. The VLA method titled SRPO is the variant in which self-reference is instantiated as trajectory-to-trajectory comparison against in-batch successes in latent world space (Fei et al., 19 Nov 2025).

7. Limitations and interpretation

The VLA SRPO paper presents clear assumptions and constraints (Fei et al., 19 Nov 2025). The method depends on a pretrained world-model encoder whose latent space must be sufficiently informative for progress estimation; the reported implementation uses V-JEPA 2, and the paper explicitly argues that the benefit depends on latent representation quality rather than on arbitrary world-model usage. It also relies on the current policy generating at least some successful trajectories, since those are the reference set from which progress rewards are computed. Without such successes, the self-referential reward mechanism cannot operate as intended.

The reward is intentionally trajectory-level rather than finely localized. This avoids hand-engineered dense shaping but also means the method does not provide token-level, state-level, or action-level fault localization. Performance further depends on practical choices such as the reward-scaling coefficient $o_t = O(z_t), \qquad a_t \sim \pi_\theta(\cdot \mid o_t, l), \qquad z_{t+1} \sim E(\cdot \mid z_t, a_t),$ 5, the clustering procedure, and the quality of the latent success manifold (Fei et al., 19 Nov 2025). In real-world robotics, the paper does not advocate unrestricted online RL; instead it adapts the core idea to an offline AWR-style regime, reflecting safety and data-collection constraints.

A second limitation is interpretive rather than algorithmic: acronym ambiguity is pervasive. Because several unrelated arXiv papers use SRPO for different expansions, exact naming must be stated whenever the term appears in scholarly writing. In the exact arXiv sense, Self-Referential Policy Optimization refers to the VLA framework of (Fei et al., 19 Nov 2025); using the acronym without clarification risks conflating it with unrelated work in preference optimization, multimodal reasoning, RLVR, or dynamics-shift regularization (Choi et al., 2024, Wan et al., 2 Jun 2025, Li et al., 2 Apr 2026, Xue et al., 2023).

More broadly, SRPO’s central contribution is a specific answer to sparse-reward RL in robotics: use the policy’s own current successes as adaptive references, measure failed behaviors in a transferable latent world representation, and convert proximity to successful modes into a usable learning signal. This suggests that self-reference becomes most valuable when binary task success masks substantial partial progress and when a stable latent representation exists that can compare behaviors without manual task decomposition (Fei et al., 19 Nov 2025).