EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Published 18 Jun 2026 in cs.CV | (2606.20092v1)

Abstract: Memory remains a critical bottleneck for long-horizon robotic manipulation, as standard Vision-Language-Action (VLA) policies often fail when task-relevant cues become occluded or unobservable over time. While existing memory-augmented methods utilize historical context, they either suffer from severe information bottlenecks, incur high latency via decoupled dual systems, or rely on unselective buffers that accumulate massive visual redundancies. To address these limitations, we introduce EventVLA, an end-to-end framework founded on the concept of sparse visual evidence memory that comprises two core components: foundational visual anchors to retain initial and short-term contexts, and a dynamic Keyframe Evidence Memory (KEM) module. Specifically, KEM directly predicts future keyframe probabilities from the VLA's latent embeddings to autonomously capture and store sparse, task-critical visual events. This foresight-driven mechanism empowers the policy to dynamically evaluate the future causal utility of current observations, preserving transient visual evidence before it becomes unobservable. Furthermore, we propose RoboTwin-MeM, a diagnostic benchmark specifically designed to evaluate non-Markovian manipulation tasks with interactive visual evidence. Extensive evaluations show that across 17 memory-requiring simulation tasks and 4 real-world bimanual tasks, EventVLA achieves an average success rate improvement of +40% over state-of-the-art memory-augmented VLAs.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper presents a novel EventVLA framework that integrates foundational visual anchors with dynamic keyframe evidence memory to capture fleeting, task-critical events.
It employs a foresight-driven scheduling mechanism using autoregressive transformer outputs, enhancing memory retention and mitigating occlusion issues in long-horizon tasks.
Experimental evaluations on the RoboTwin-MeM benchmark and real-world scenarios demonstrate up to a 90% success rate and a 40% improvement over previous state-of-the-art methods.

EventVLA: Event-Driven Visual Evidence Memory for Long-Horizon Vision-Language-Action Policies

Motivation and Problem Statement

The paper addresses the critical bottleneck encountered in memory-aware vision-language-action (VLA) policies deployed for long-horizon robotic manipulation. Standard VLA models operate under a strict Markovian assumption, presupposing persistent visibility of all task-relevant information. In realistic physical environments, however, dynamic occlusion and transient interactions frequently render essential intermediate states unobservable. Existing memory-augmented frameworks fall short—either suffering from severe information bottlenecks (recurrent networks), high latency and error propagation (dual-system architectures), or computationally inefficient redundancy (unselective memory buffers). The central challenge is to design an efficient mechanism for robust retention of sparse, task-critical visual evidence without overwhelming computational resources.

EventVLA Architecture and Keyframe Evidence Memory

EventVLA is introduced as an end-to-end framework that incorporates sparse visual evidence memory tailored for non-Markovian, long-horizon manipulation. Its memory system is structured around two main components:

Foundational Visual Anchors (VA): Deterministic selection of the initial observation (capturing invariant global layout) and a short-term sliding history window (providing local motion cues). This approach efficiently addresses tasks structured around persistent spatial layouts and immediate temporal context, but is fundamentally insufficient for transient evidence retention.
Keyframe Evidence Memory (KEM) Module: A dynamic memory mechanism designed to autonomously capture and store sparse, interaction-driven events. The KEM module directly predicts future keyframe probabilities via chunk-wise projections from VLA's latent embeddings (autoregressive transformer outputs). This foresight-driven process enables proactive scheduling of memory writes, locking critical frames—such as object exposures or sequence demonstrations—before they disappear from the observable workspace. KEM operates in parallel with action heads, leveraging both visual and action-conditioned cues, and utilizes a chunk-wise prediction horizon coupled with a learnable threshold, non-maximum suppression (NMS), and temporal cooldown for strict memory sparsity.

EventVLA's training employs an end-to-end optimization regime, combining standard action generation loss with a sequence-averaged binary cross-entropy (BCE) objective for temporally smoothed keyframe prediction supervision. The automated VLM-based annotation pipeline mitigates manual labeling costs and reliably produces precision keyframe supervision for large-scale demonstrations.

RoboTwin-MeM Benchmark and Evaluation Protocol

To rigorously evaluate intermediate memory capabilities, the paper introduces RoboTwin-MeM, a diagnostic simulation benchmark built on the RoboTwin 2.0 platform. Unlike conventional suites (e.g., RMBench) dominated by persistent, observable states, RoboTwin-MeM isolates genuinely non-Markovian tasks requiring transient event retention. Each task is parameterized by the number of required intermediate keyframes (n), spanning a tiered hierarchy from $n=1$ to $n=5$ . Tasks include occlusion-driven object picking, sequential button-press counting, spatial imitation, and randomized route reproduction. Benchmarks are structured with explicit language-grounded instructions and dense annotation, facilitating robust evaluation for in-context memory retention, counting logic, and transient event capture.

Experimental Results and Ablation Studies

EventVLA was evaluated across 17 simulation tasks and 4 real-world manipulation scenarios. On RMBench, a pure visual anchor variant of EventVLA achieves an average success rate of 67.8%, outperforming both memoryless and prior memory-augmented baselines. However, on RoboTwin-MeM, this baseline drops to 18.0%, whereas full EventVLA (VA+KEM) achieves 75.2%—a 40% improvement over previous state-of-the-art models. In real-world evaluations, EventVLA consistently surpasses both reactive and memory-augmented baselines, with success rates up to 90%.

Core ablation analysis reveals:

Implicit memory (latent feature aggregation) causes severe information loss compared to explicit raw frame concatenation.
Hard binary label supervision destabilizes the model, whereas temporally smoothed soft labels provide essential tolerance for physical event ambiguity.
NMS and buffer capacity management are crucial: removing NMS or limiting buffer size leads to redundant frame flooding and premature eviction of critical evidence.
Shrinking chunk size (foresight window) drastically reduces proactive event scheduling, neutralizing KEM's ability to anticipate and capture future task-critical states.

Inference profiling confirms the computational feasibility of EventVLA's multi-frame attention and event-driven memory scheduling, maintaining throughputs suitable for physical robot deployment.

Implications, Limitations, and Future Directions

EventVLA demonstrates that integrating foresight-driven, dynamic sparse event memory with foundational visual anchors effectively bridges the gap between Markovian and non-Markovian control in robotic policies. The framework is scalable, achieves strong empirical gains, and is robust to both simulation and real-world occlusion dynamics. The diagnostic RoboTwin-MeM benchmark exposes deficiencies in prior memory approaches and offers a rigorous platform for evaluating intermediate state retention.

Practical implications include a substantial improvement in long-horizon physical task execution, with real-time feasibility under typical robotic control constraints. Theoretical implications point toward the necessity of integrating proactive future prediction and dynamic temporal abstraction in embodied AI memory systems.

Potential future directions involve hierarchical or compressed event memory to accommodate extremely long-horizon, high-density scenarios, and further exploration of memory scheduling strategies beyond fixed FIFO eviction. Extensions could leverage more advanced chunk-wise prediction architectures or non-local attention mechanisms to enhance temporal reasoning.

Conclusion

EventVLA advances the design and evaluation of memory in vision-language-action policies for non-Markovian long-horizon robotic manipulation. Through its combination of foundational visual anchors and foresight-driven keyframe evidence memory, it achieves substantial improvements over state-of-the-art memory frameworks, both in simulation and challenging physical environments. RoboTwin-MeM establishes a robust benchmark for intermediate state retention, enabling more rigorous analysis of memory-augmented robotic policies. The framework's practical and theoretical contributions lay the groundwork for future scalability and generalization in embodied AI memory systems (2606.20092).

Markdown Report Issue