Memory equalization for CUA grounding

Establish whether memory of prior user-interface interactions (e.g., UI element locations, navigation paths, successful and failed actions) equalizes the grounding performance of small vision–language models relative to large vision–language models in Computer Use Agent tasks; specifically, determine if a small (approximately 7B-parameter) VLM augmented with UI-layout memory attains grounding accuracy comparable to a larger VLM without memory, within a small tolerance as formalized by the Memory Equalization definition.

Background

The paper defines a Memory Equalization hypothesis: memory is an equalizer if a small model with memory performs nearly as well as a large model without memory. Evidence from text-based agent tasks (OpenClaw) shows this effect with ε = 0, but the authors note that this has not been tested for GUI grounding in computer-use settings.

In the CUA setting, the proposed memory consists of retrieved, application-specific interaction history such as element coordinates and tool layouts. The authors speculate that such memory could reduce the need for expensive escalations by helping small models avoid repeated grounding from scratch, but they explicitly flag this as an untested conjecture.

References

For CUA grounding, we conjecture that memory of UI layouts partially equalizes the models—a warm 7B VLM that “remembers” where the Save button is in Photoshop does not need to re-ground it from scratch. This conjecture remains untested.

Adaptive Vision-Language Model Routing for Computer Use Agents  (2603.12823 - Liu et al., 13 Mar 2026) in Section 4.4 (Memory as a Model-Size Equalizer)