MemGUI-Eval: Memory Benchmark Engine

Updated 10 February 2026

MemGUI-Eval is an evaluation engine that uses a three-stage, LLM-driven pipeline called Progressive Scrutiny to assess memory in mobile GUI agents.
It produces seven hierarchical metrics measuring short-term recall, long-term learning, and execution efficiency through multi-attempt (pass@k) evaluations.
The system exposes memory failure modes and resource-performance trade-offs to guide architectural improvements in stateful GUI benchmarks.

MemGUI-Eval is the evaluation engine at the core of MemGUI-Bench, a benchmark specifically constructed to diagnose memory capabilities in mobile GUI agents. Unlike generic task-completion judges, MemGUI-Eval deploys a staged, LLM-driven pipeline—"Progressive Scrutiny"—and produces seven carefully defined hierarchical metrics that quantify both short-term and long-term memory, as well as execution efficiency. Its pipeline is embedded in MemGUI-Bench's 128-task suite, targeting 26 Android applications and leveraging a high proportion of cross-temporal and cross-spatial memory tasks (89.8% of the suite). Evaluation is performed over multiple attempts (pass@k), enabling assessment of both immediate recall and cross-session learning. By integrating scalable automation with semantic and visual reasoning, MemGUI-Eval surfaces memory failure modes and resource-performance trade-offs in state-of-the-art GUI agents (Liu et al., 3 Feb 2026).

1. Role within the MemGUI-Bench Framework

MemGUI-Eval constitutes the assessment backbone of MemGUI-Bench, which is designed to overcome the limitations of previous GUI agent benchmarks (with only 5.2–11.8% memory-centric tasks and no cross-session learning probes). MemGUI-Bench comprises (a) a memory-intensive, multitask suite, (b) an execution environment supporting stateful, snapshot-based resets and retries for multi-attempt evaluation (pass@k), and (c) MemGUI-Eval, which ingests agent trajectories—action logs, screenshots, and UI trees—during or after interaction with the Android emulator. After a given attempt or upon reaching a step limit, MemGUI-Eval adjudicates task completion and processes all necessary evidence for subsequent metric computation.

2. Progressive Scrutiny: Three-Stage Judging

MemGUI-Eval's "Progressive Scrutiny" mechanism structurally mimics expert human graders by escalating from lightweight to rich evidence only as needed. The process is decomposed into three stages:

Stage 1: Cost-Effective Triage The Triage Judge LLM receives only the task goal, action log, and last three screenshots. A "Success" verdict is issued only if criteria are unmistakably met; ambiguity triggers escalation to Stage 2.
Stage 2: Full Semantic Analysis Input includes the full action and step log with all screenshots:
- Step Descriptor generates structured summaries (JSON with "action_description" and "ui_description") for each step.
- The Semantic Judge assimilates these summaries, the goal, and trailing screenshots to comprehensively check goal satisfaction. In cases of ambiguity, it specifies additional "required_steps" (key screenshots) needed for final judgment.
- For multi-unit memory task failures, the IRR Analyzer computes Information Retention Rate.
Stage 3: Targeted Visual Verification The Visual Judge, another LLM, is presented only with the requested historical screenshots and text context, enforcing a definitive binary (success/failure) decision and, if applicable, IRR computation. This selective provision of evidence is designed to avoid token overload in large context models.

This staged pipeline enables high-precision, scalable evaluation and reduction of computational cost, with error detection proceeding from high-confidence calls to deep, targeted analysis only when required (Liu et al., 3 Feb 2026).

3. Metric Suite: Seven Hierarchical Measures

MemGUI-Eval outputs a comprehensive set of evaluative metrics, structured across three interdependent dimensions:

Dimension	Metric	Formula (as in source)
Short-Term Memory Fidelity	Success Rate (SR)	$\mathrm{SR} = \frac{1}{N}\sum_{i=1}^{N}\mathbf{1}[\text{success@1}]$
	Information Retention Rate (IRR)	$\mathrm{IRR}_i = \frac{C_i}{T_i}$ , $\overline{\mathrm{IRR}}$ across $\mathcal{M}$
	Memory-Task Proficiency Ratio (MTPR)	$\mathrm{MTPR} = \frac{\mathrm{SR}_{\text{memory}}}{\mathrm{SR}_{\text{standard}}}$
Long-Term Learning	pass@ $k$ Success Rate	$\mathrm{pass@}k = \frac{1}{N}\sum_{i=1}^{N}\mathbf{1}[\text{success}@k]$
	Failure Recovery Rate (FRR)	$\mathrm{FRR} = \frac{1}{N_f}\sum_{i=2}^k \frac{1}{i-1} R_i$
Execution Efficiency	Average Step Ratio	$r_i = \frac{A_i}{G_i}, \overline{r}=\frac{1}{\|\mathcal{S}\|}\sum_{i\in\mathcal{S}} r_i$
	Avg Time/Step & Cost/Step	$\tau_i = \frac{t_i}{A_i}, \bar\tau=\frac1N\sum_i\tau_i$ ; $\mathrm{IRR}_i = \frac{C_i}{T_i}$ 0

These metrics quantify immediate task completion (SR), memory-specific correct recall (IRR), relative memory challenge (MTPR), learning over repeated attempts (pass@k), rapidity of recovery from initial failure (FRR), optimality versus reference trajectories (step ratio), and computational/economic cost (time and API-tokens/step) (Liu et al., 3 Feb 2026).

4. Integration of pass@k: Capturing Cross-Session Learning

MemGUI-Eval operationalizes pass@k evaluation through the underlying snapshot framework. For each task, agents are allowed up to $\mathrm{IRR}_i = \frac{C_i}{T_i}$ 1 attempts (typically $\mathrm{IRR}_i = \frac{C_i}{T_i}$ 2). After each failed attempt (or budget exhaustion), the task is reset from a predefined snapshot, and MemGUI-Eval reassesses upon each new attempt. This methodology explicitly reveals both immediate and cross-attempt learning. Metrics such as pass@k and FRR (failure recovery rate, harmonically weighted to reward faster recovery) quantitatively register improvements across retries, exposing the degree to which agents can accumulate and utilize experience for memory-intensive tasks.

For example, in long-term learning scenarios, Agent-S2 demonstrates an increase in pass@1 SR from 27.3% to 49.2% at pass@3, with a FRR of 21.5%, indicating that significant recovery occurs between attempts 1 and 2 (Liu et al., 3 Feb 2026).

5. LLM-as-Judge Mechanisms and Prompting

Each evaluation stage in MemGUI-Eval is governed by distinct LLM roles, primarily using Gemini 2.5 (Flash or Pro), with customized system prompts:

Triage Judge: Mandated to only pass tasks when conditions are "indisputable"; otherwise, defers with "Uncertain."
Step Descriptor: Converts step-level screenshot-action context into structured JSON representations.
Semantic Judge: Mandates verification of every goal component, refers unresolved dependencies to Stage 3 by enumerating required evidence.
IRR Analyzer: Calculates fine-grained information retention for memory task failures, following explicit formulas.
Visual Judge: Given minimal but necessary historical screenshots and context, must issue a final binary decision.

By integrating LLMs as multi-level graders, MemGUI-Eval systematically balances cost and reliability and constrains the subjectivity often observed with unconstrained LLM-as-judge paradigms (Liu et al., 3 Feb 2026).

6. Diagnosis: Memory Case Studies and Failure Typology

MemGUI-Eval supports fine-grained assessment of cross-temporal (long-term) and cross-spatial (multi-app/screen) memory. For instance, top frameworks achieve 46.4%–50.0% SR on single-app tasks, but fall to 10.0%–30.0% on complex four-app tasks, with commensurate IRR declines. This documents the substantial stress multi-app retention places on short-term memory modules.

Analysis of 343 non-timeout failures yields five principal failure modes:

Partial Memory Hallucination (PMH): IRR ∈ (0,100), only partial recall.
Process Memory Hallucination (ProcMH): IRR=0, procedural/context loss.
Output Memory Hallucination (OMH): IRR=0, failure at output or transcription.
Knowledge Deficiency (KD): Application/domain knowledge omissions.
Intent Misunderstanding (IM): Goal/instruction misinterpretation.

Memory hallucinations collectively account for 58.9% of non-timeout failures. These diagnostic categories, along with the "required_steps" triggers and IRR quantification, enable systematic error attribution and guide architecture refinement (Liu et al., 3 Feb 2026).

7. Quantitative Insights and Architectural Implications

MemGUI-Eval empirically exposes large capability gaps in memory-intensive regimes: compared to standard benchmarks, GUI agents exhibit 4–10× larger deficits (MTPR ≈ 0.1–0.45). Framework-based agents (Agent-S2, M3A, T3A) surpass end-to-end models by a significant margin, with SR@1 of 22.7–32.8% versus only 0–6.2% for end-to-end baselines.

The detailed failure patterning and step-level cost/proficiency metrics motivate five core design implications:

Multi-granularity memory buffers to reduce partial memory hallucinations.
Hierarchical task decomposition and persistent goal trackers to prevent process/context losses.
Strategic exploitation of long-context capabilities in models with sizable context windows.
Persistent, explicit long-term memory for experience reuse and cross-session learning.
Hybrid architectures that blend classical framework structure with lightweight, end-to-end learning for optimal memory/cost trade-offs.

Top-performing hybrid (M3A) achieves 21.9% pass@3—at only 31% of the token cost of the most memory-robust (Agent-S2) configuration—demonstrating the value of architectural optimization.

MemGUI-Eval's diagnostic, multi-metric, and staged methodology represents a substantial advance in the systematic, resource-aware, and interpretable evaluation of memory in complex, realistic GUI agent settings (Liu et al., 3 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (1)

MemGUI-Bench: Benchmarking Memory of Mobile GUI Agents in Dynamic Environments (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MemGUI-Eval.