Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

Published 11 May 2026 in cs.RO and cs.AI | (2605.10094v2)

Abstract: Vision-Language-Action (VLA) models show strong potential for general-purpose robotic manipulation, yet their closed-loop reliability often degrades under local deployment conditions. Existing evaluations typically treat test episodes as independent zero-shot trials. However, real robots often operate repeatedly in the same or slowly changing environments, where successful executions provide environment-verified evidence of reliable behavior patterns. We study this persistent-deployment setting, asking whether a partially competent frozen VLA can improve its reliability by reusing its successful test-time experience. We propose an online success-memory guided test-time adaptation framework for generative VLAs. During deployment, the robot stores progress-calibrated successful observation-action segments in a long-term memory. At inference, it retrieves state-relevant action chunks, filters inconsistent candidates via trajectory-level consistency, and aggregates them into an elite action prior. To incorporate this prior into action generation, we introduce confidence-adaptive prior guidance, which injects the elite prior into an intermediate state of the flow-matching action sampler and adjusts the guidance strength based on retrieval confidence. This design allows the frozen VLA to exploit environment-specific successful experience while preserving observation-conditioned generative refinement. This retrieve-then-steer mechanism enables lightweight, non-parametric test-time adaptation without requiring parameter updates. Simulation and real-world experiments show improved task success and closed-loop stability, especially in long-horizon and multi-stage tasks.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces a non-parametric test-time adaptation framework that uses an online progress-calibrated success memory to enhance VLA reliability.
It combines cosine similarity and dynamic time warping to retrieve and aggregate elite behavior priors from verified trajectories.
Experimental results show improved success rates in both simulation and real-world scenarios without requiring further model training.

Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs

Problem Statement and Motivation

The paper "Retrieve-then-Steer: Online Success Memory for Test-Time Adaptation of Generative VLAs" (2605.10094) addresses the fundamental misalignment between the evaluation of Vision-Language-Action (VLA) models and their real-world deployment conditions. While generative VLAs—particularly those employing diffusion or flow-matching architectures—demonstrate strong few-shot and zero-shot generalization for robotic manipulation, their performance in persistent, closed-loop deployments is subject to instability due to distribution shifts and environment-specific idiosyncrasies. Existing test paradigms typically reset after every episode and discard successful or failed experience, ignoring that in practice, robots often operate in stable or slowly changing settings where successful actions are strong evidence of reliable behavior under the current physical and visual context.

The key contribution of this paper is to reconceptualize VLA deployment as a persistent and adaptive process capable of leveraging accumulated, environment-verified, successful trajectories during test time. The authors propose a non-parametric test-time adaptation (TTA) framework in which a frozen VLA policy can self-improve reliability using only its own history of environment-specific successful behaviors, eschewing parameter updates, extra supervision, or human feedback.

Methodology

The core framework centers on an online progress-calibrated memory system, which continuously accrues successful observation-action segments during deployment. The process is as follows:

Online Progress-Calibrated Success Memory: At every episode, candidate observation-action chunks are buffered. A pretrained VLAC critic model evaluates progress at intervals relative to a reference demonstration, identifying the maximal progress ("progress peak") point. Observation-action prefixes up to this point, contingent on passing a success threshold, are admitted into the memory as verified, reusable behaviors. The method evaluates success with high-precision progress discrimination, minimizing the contamination of memory with failed or regressive experience.
Retrieval and Trajectory-Level Consistency Filtering: At inference, given the current observation, state-relevant segments are efficiently retrieved from memory using cosine similarity. Top-K similar candidates are filtered with a similarity threshold, then pruned using pairwise Dynamic Time Warping (DTW) to ensure candidate actions are not just state-similar but also trajectory-consistent.
Elite Prior Aggregation: Rather than direct nearest neighbor replay, retrieved action chunks are fused into an "elite" action prior via soft, similarity-weighted aggregation, with careful handling of different action spaces (Euclidean, SO(3), discrete).
Confidence-Adaptive Prior Guidance: The elite prior is softly injected at an intermediate step of the generative sampler (flow-matching or diffusion), with the guidance strength modulated by a retrieval confidence score, itself computed from similarity and DTW dispersion statistics. High confidence enforces strong guidance; low confidence defaults to the original VLA sampler, maintaining robustness against misretrieval.

This pipeline enables test-time adaptation and resilience to environment-specific variations, purely through memory and reweighting mechanisms, sidestepping any need for policy retraining or additional manual labels.

Experimental Results

The framework is benchmarked on both simulation (LIBERO-10, SimplerEnv) and real-world (OpenArm, ALOHA-PiPER) robotic manipulation tasks. Key findings are:

LIBERO-10 Benchmarks: The proposed method improves the base flow-matching model To from 81.6% to 84.4% average success rate, and the stronger To.5 model from 92.4% to 94.4%. Relative to the previous best test-time steering method, TACO, absolute gains are observed, especially on long-horizon, multi-stage tasks. On tasks such as "Moka Pots on Stove," gains exceed +6%.
SimplerEnv: CogACT baseline is improved from 75.8% to 79.5% success rate.
Real-World Robotic Manipulation: Notable increases across all evaluated tasks—
- Test-tube placement (OpenArm): full-completion rate improves from 18% to 24%, and mean completed stages increase.
- Bowl stacking and cube handoff (OpenArm): bowl stacking success increases from 72% to 80%, and cube handoff from 40% to 52%.
- Bimanual T-shirt folding (ALOHA-PiPER) under visual domain shift: average success rate increases from 39% to 48%.
Ablations: Memory quality is shown critical—unverified or poorly filtered memory degrades performance sharply. Aggregation and trajectory-level filtering outperform naive nearest-neighbor replay. Dynamic, confidence-adaptive prior injection outperforms fixed-strength or post-hoc interpolation.
Efficiency: The method introduces moderate inference overhead (1.10× baseline), predominantly due to retrieval and filtering. Prior-guided generative sampling is actually faster than baseline sampling in isolation.

Theoretical and Practical Implications

Practically, this method demonstrates that VLAs, when framed within a persistent deployment paradigm, can achieve substantial improvements in reliability and closed-loop stability with no additional training. The non-parametric, memory-augmented adaptation decouples test-time adaptation from the need for further gradients or manual relabeling, making it compatible with safety-critical or resource-constrained deployments.

Theoretically, this exposes a key axis of adaptability in deployed embodied intelligence systems: non-parametric behavioral priors accumulate verifiable evidence and fundamentally alter the generative distribution at inference, without violating the frozen assumption of the base model. The retrieve-then-steer mechanism establishes a lightweight, scalable meta-algorithm for leveraging high-confidence past experience in continual adaptation.

Limitations and Future Directions

The efficacy of success memory hinges on recurring structure in the deployment environment; rapidly changing tasks or physical setups may dilute the utility of stored priors. Early effective adaptation presupposes that the base policy is not catastrophically incompetent, as a minimal set of successful seeds are necessary. Quality of the success estimator and the granularity of state and trajectory retrieval also bound the potential gains. Failure modes include retrieval of action priors that are state-similar but contextually misaligned, leading to policy misguidance. Current memory management is FIFO-based and may require evolution toward hierarchical, task-aware, or uncertainty-driven schemes in truly lifelong learning contexts.

Future work may encompass more expressive and uncertainty-aware success verification, meta-learning for memory management, and explicit environment change detection, as well as adaptation for highly dynamic or adversarial deployment settings where prior experience may be only weakly predictive.

Conclusion

This work introduces a principled and effective approach to test-time adaptation for generative VLA models, leveraging online success memory as a reusable, confidence-weighted behavioral prior. Through rigorous simulation and real-world evaluations, the retrieve-then-steer mechanism consistently yields improved closed-loop performance—especially on the regime of long-horizon, multi-stage manipulation—without resorting to any policy parameter update. This establishes a new direction for non-parametric persistent adaptation in embodied AI, with significant implications for robust real-world deployment of large generative robotic policies (2605.10094).

Markdown Report Issue