Dual-domain Feature Importance Attack (DuFIA)
- Dual-domain Feature Importance Attack (DuFIA) is a novel technique that assesses and exploits vulnerabilities by integrating feature importance across two distinct data domains.
- It employs dual-domain metrics that combine complementary features to generate graded importance scores for targeted attacks.
- Preliminary evaluations indicate that DuFIA improves attack precision by revealing subtle vulnerabilities overlooked by single-domain methods.
Searching arXiv for exact and related SRPO papers to ground the article in current literature. arXiv.search_documents(query="ti:\"Self-Referential Policy Optimization\" OR all:\"self-referential policy optimization\" OR all:\"SRPO\" ", max_results=10) arXiv returned these potentially relevant results:
- (Fei et al., 19 Nov 2025) — "SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models"
- (Choi et al., 2024) — "Self-Improving Robust Preference Optimization"
- (Zhang et al., 19 Apr 2025) — "SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM"
- (Wan et al., 2 Jun 2025) — "SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware Reinforcement Learning"
- (Xue et al., 2023) — "State Regularized Policy Optimization on Data with Dynamics Shift"
- (Li et al., 2 Apr 2026) — "Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing"
- (Li et al., 29 Dec 2025) — "InSPO: Unlocking Intrinsic Self-Reflection for LLM Preference Optimization"
- (Yu et al., 2 Mar 2026) — "Provable and Practical In-Context Policy Optimization for Self-Improvement"
- (Liu et al., 25 Apr 2026) — "Escher-Loop: Mutual Evolution by Closed-Loop Self-Referential Optimization"
- (Lee et al., 27 Jul 2025) — "SGPO: Self-Generated Preference Optimization based on Self-Improver" Self-Referential Policy Optimization (SRPO) denotes a reinforcement-learning post-training framework for Vision-Language-Action (VLA) models in which the model’s own successful trajectories, generated within the current training batch, are used as a self-reference for assigning progress-wise reward to failed attempts in latent world-model space (Fei et al., 19 Nov 2025). In contemporary arXiv usage, however, the acronym is overloaded: other papers use SRPO for “Self-Improving Robust Preference Optimization,” “two-Staged history-Resampling Policy Optimization,” “Self-Reflection enhanced reasoning with Group Relative Policy Optimization,” “State Regularized Policy Optimization,” and “Sample-Routed Policy Optimization,” so the expansion must be specified explicitly in technical discussion (Choi et al., 2024, Zhang et al., 19 Apr 2025, Wan et al., 2 Jun 2025, Xue et al., 2023, Li et al., 2 Apr 2026).
1. Nomenclature and scope
The exact expansion “Self-Referential Policy Optimization” appears in the VLA paper “SRPO: Self-Referential Policy Optimization for Vision-Language-Action Models” (Fei et al., 19 Nov 2025). By contrast, several other papers use the same acronym for unrelated objectives or training recipes, which makes acronym-level disambiguation necessary rather than optional.
| Expansion | arXiv id | Setting |
|---|---|---|
| Self-Referential Policy Optimization | (Fei et al., 19 Nov 2025) | VLA RL for robotic manipulation |
| Self-Improving Robust Preference Optimization | (Choi et al., 2024) | Offline RLHF / preference optimization |
| two-Staged history-Resampling Policy Optimization | (Zhang et al., 19 Apr 2025) | Cross-domain RL on LLMs |
| Self-Reflection enhanced reasoning with Group Relative Policy Optimization | (Wan et al., 2 Jun 2025) | Multimodal reasoning with GRPO |
| State Regularized Policy Optimization | (Xue et al., 2023) | RL under dynamics shift |
| Sample-Routed Policy Optimization | (Li et al., 2 Apr 2026) | RLVR for LLM post-training |
Within the exact sense of “Self-Referential Policy Optimization,” the defining claim is that RL for VLA manipulation should no longer rely only on sparse binary success indicators. Instead, successful trajectories already produced by the policy should be reused as in-batch references for scoring failures, thereby replacing discarded failed rollouts with graded progress information (Fei et al., 19 Nov 2025).
2. Problem formulation in vision-language-action reinforcement learning
SRPO is introduced for VLA models that are typically trained from expert demonstrations and therefore inherit strong dependence on demonstration data and demonstration bias. In the paper’s framing, supervised VLA training tends to overfit small downstream datasets and remain constrained by the demonstrated state-action distribution, while standard RL post-training is hindered by severe reward sparsity because most existing methods use a binary success reward: $1$ if the task is completed and $0$ otherwise (Fei et al., 19 Nov 2025).
The formal setup uses observation , language goal , and policy
Environment dynamics are written as
A trajectory is therefore
The sparse terminal environment reward is denoted
and the successful trajectories in the current batch are
SRPO is used as a post-training method on top of supervised initialization: the reported pipeline begins with one-shot SFT from an official OpenVLA checkpoint using one trajectory per task, then applies SRPO online RL post-training (Fei et al., 19 Nov 2025).
3. Self-reference in latent world space
The core innovation is to treat the current batch’s successful trajectories as a reference set rather than using expert demonstrations or manually designed subgoal rewards. Those successful trajectories are not used individually in raw observation space. Instead, each trajectory is encoded by a pretrained world-model encoder : $0$0 The paper uses V-JEPA3 OR all:\3as the latent world model and applies DBSCAN to the successful trajectory representations to obtain representative success patterns (Fei et al., 19 Nov 2025).
For each trajectory $0$1, SRPO computes the squared Euclidean distance to the nearest successful center: $0$2 This distance is then converted into a trajectory-level progress-wise reward
$0$3
where $0$4 is an activation function mapping into $0$5, and the implementation uses a sigmoid with progress reward scaling coefficient $0$6, with $0$7 reported as best (Fei et al., 19 Nov 2025).
This construction is explicitly trajectory-level rather than fine-grained step shaping. The intended effect is that a failure closer to a successful behavioral mode receives more reward than a failure far from any successful mode. The latent-space comparison is meant to avoid the fragility of raw-pixel matching and the weaker robotics specificity of generic image embeddings such as ImageBind (Fei et al., 19 Nov 2025).
4. Optimization objective and training procedure
Once SRPO has assigned self-referential trajectory rewards, optimization follows a GRPO-style clipped objective with KL regularization. The policy ratio is
$0$8
Trajectory rewards are normalized into group-relative advantages: $0$9 with group statistics also written in the paper as
0
The clipped surrogate term is
1
and KL regularization is
2
The overall objective is written as
3
Operationally, the training loop is: initialize from one-shot SFT; roll out the current policy; partition trajectories into successes and failures; encode trajectories with the pretrained world model; cluster successful representations with DBSCAN; compute latent distances and progress-wise rewards; normalize them into advantages; and update the policy with the clipped ratio objective plus KL regularization (Fei et al., 19 Nov 2025).
5. Empirical performance, reward quality, and ablations
On LIBERO, the reported baseline is OpenVLA*-One with average success 48.9, while Offline SRPO reaches 92.5 and Online SRPO reaches 99.2 (Fei et al., 19 Nov 2025).
| Setting | Average success | Note |
|---|---|---|
| OpenVLA*-One | 48.9 | one-trajectory-per-task SFT |
| Offline SRPO | 92.5 | offline post-training |
| Online SRPO | 99.2 | about 200 RL steps |
The paper reports per-suite online SRPO results of 98.8 on Spatial, 100.0 on Object, 99.4 on Goal, and 98.6 on Long, and describes the jump from 48.9% to 99.2% as a 103% relative improvement (Fei et al., 19 Nov 2025). On LIBERO-Plus, Online SRPO reaches 59.6 in the zero-shot setting versus 19.4 for OpenVLA*-One, which the abstract summarizes as a 167% performance improvement, and reaches 82.1 with augmented data versus 30.7 for OpenVLA*-One (Fei et al., 19 Nov 2025).
The reward-quality analysis compares pixel-level progress reward, ImageBind-based reward, and SRPO’s latent world representation reward. The reported benchmark values are: Pixel-level 4, ImageBind 5, and SRPO 6 for SC, Mono, MMD, JS, and SMD respectively (Fei et al., 19 Nov 2025). The paper interprets this as evidence that latent world representations give more temporally monotonic progress signals and better success-failure separation than pixel-level or generic multimodal embeddings.
Ablations isolate two particularly important design choices. Replacing current-batch self-reference with a fixed set of 50 expert trajectories per task still improves over GRPO, but trains more slowly, needs about 1.4× the training steps, and underperforms full SRPO. Removing success clustering and comparing only to the nearest successful trajectory gives similar early learning but worse later performance, especially once multiple successful modes emerge. The reward-weight sweep reports performance order
7
supporting the claim that progress reward should be strong but not dominant (Fei et al., 19 Nov 2025).
6. Relation to adjacent self-referential optimization methods
Several adjacent papers instantiate “self-reference” in different ways. In “Self-Improving Robust Preference Optimization,” the self-improvement policy 8 conditions on a context 9 and an in-context completion 0 and outputs an improved completion 1; the same LLM can serve as both the ordinary generative policy 2 and the reviser 3, and the paper’s main robustness claim is that the optimal self-improvement policy and robust generative policy are independent of the behavior distribution 4 (Choi et al., 2024).
In SGPO, the improver and policy are unified into a single model: the model first samples its own answer, then the same model refines that answer, and the refined response becomes the preferred item in a DPO update. The paper presents this as on-policy self-improvement, but also states that the improver is bootstrapped using an external LLM, so it is not fully self-referential from initialization (Lee et al., 27 Jul 2025).
In InSPO, the trainable policy is expanded from 5 to 6, so generation is conditioned on both the prompt and an alternative response. The paper describes this as “intrinsic self-reflection” and emphasizes that the self-reflective mechanism is distilled into a standard autoregressive policy with zero extra inference overhead (Li et al., 29 Dec 2025).
ICPO shifts the same theme to inference time rather than parameter updates: the model improves its response through multi-round self-reflection at inference by conditioning on its own prior responses and self-assessed or externally observed rewards, and the practical algorithm ME-ICPO performs no parameter updates at test time (Yu et al., 2 Mar 2026). Escher-Loop broadens self-reference further into a closed-loop population system in which optimizer agents refine task agents and themselves, and optimizer scores are derived from the task outcomes they induce (Liu et al., 25 Apr 2026).
7. Limitations, boundaries, and interpretation
For the VLA SRPO method itself, several constraints are explicit. The approach depends on a pretrained world-model encoder with useful latent representations, requires the current policy to generate at least some successful trajectories so that self-references exist, uses trajectory-level shaping rather than finer-grained shaping by design, and is sensitive to reward scaling and clustering choices. In real-world robotics, the paper does not use online RL for safety and cost reasons, but instead adapts the idea to an offline RL variant integrating Advantage-Weighted Regression with SRPO-style self-referential progress reward (Fei et al., 19 Nov 2025).
Across the broader literature, “self-referential” does not denote a single canonical mechanism. SGPO is self-generated during its preference phase but externally bootstrapped; ICPO is explicitly self-improving but performs no parameter updates at test time; InSPO makes training comparison-aware by conditioning on alternative responses while keeping deployment prompt-only (Lee et al., 27 Jul 2025, Yu et al., 2 Mar 2026, Li et al., 29 Dec 2025). This suggests that the term refers less to one fixed update rule than to a family of constructions in which a model’s own outputs, alternative responses, or induced outcomes are fed back into optimization.
In the exact sense established by the VLA paper, however, SRPO is a specific reward-shaping and policy-optimization framework: successful trajectories produced by the current policy define the reference set; latent world representations provide the comparison geometry; failed trajectories receive graded progress credit rather than uniform zeros; and a group-relative clipped objective turns that self-reference into efficient RL post-training for robotic manipulation (Fei et al., 19 Nov 2025).