MemGUI-Eval: Memory Benchmark Engine
- MemGUI-Eval is an evaluation engine that uses a three-stage, LLM-driven pipeline called Progressive Scrutiny to assess memory in mobile GUI agents.
- It produces seven hierarchical metrics measuring short-term recall, long-term learning, and execution efficiency through multi-attempt (pass@k) evaluations.
- The system exposes memory failure modes and resource-performance trade-offs to guide architectural improvements in stateful GUI benchmarks.
MemGUI-Eval is the evaluation engine at the core of MemGUI-Bench, a benchmark specifically constructed to diagnose memory capabilities in mobile GUI agents. Unlike generic task-completion judges, MemGUI-Eval deploys a staged, LLM-driven pipeline—"Progressive Scrutiny"—and produces seven carefully defined hierarchical metrics that quantify both short-term and long-term memory, as well as execution efficiency. Its pipeline is embedded in MemGUI-Bench's 128-task suite, targeting 26 Android applications and leveraging a high proportion of cross-temporal and cross-spatial memory tasks (89.8% of the suite). Evaluation is performed over multiple attempts (pass@k), enabling assessment of both immediate recall and cross-session learning. By integrating scalable automation with semantic and visual reasoning, MemGUI-Eval surfaces memory failure modes and resource-performance trade-offs in state-of-the-art GUI agents (Liu et al., 3 Feb 2026).
1. Role within the MemGUI-Bench Framework
MemGUI-Eval constitutes the assessment backbone of MemGUI-Bench, which is designed to overcome the limitations of previous GUI agent benchmarks (with only 5.2–11.8% memory-centric tasks and no cross-session learning probes). MemGUI-Bench comprises (a) a memory-intensive, multitask suite, (b) an execution environment supporting stateful, snapshot-based resets and retries for multi-attempt evaluation (pass@k), and (c) MemGUI-Eval, which ingests agent trajectories—action logs, screenshots, and UI trees—during or after interaction with the Android emulator. After a given attempt or upon reaching a step limit, MemGUI-Eval adjudicates task completion and processes all necessary evidence for subsequent metric computation.
2. Progressive Scrutiny: Three-Stage Judging
MemGUI-Eval's "Progressive Scrutiny" mechanism structurally mimics expert human graders by escalating from lightweight to rich evidence only as needed. The process is decomposed into three stages:
- Stage 1: Cost-Effective Triage The Triage Judge LLM receives only the task goal, action log, and last three screenshots. A "Success" verdict is issued only if criteria are unmistakably met; ambiguity triggers escalation to Stage 2.
- Stage 2: Full Semantic Analysis
Input includes the full action and step log with all screenshots:
- Step Descriptor generates structured summaries (JSON with "action_description" and "ui_description") for each step.
- The Semantic Judge assimilates these summaries, the goal, and trailing screenshots to comprehensively check goal satisfaction. In cases of ambiguity, it specifies additional "required_steps" (key screenshots) needed for final judgment.
- For multi-unit memory task failures, the IRR Analyzer computes Information Retention Rate.
- Stage 3: Targeted Visual Verification The Visual Judge, another LLM, is presented only with the requested historical screenshots and text context, enforcing a definitive binary (success/failure) decision and, if applicable, IRR computation. This selective provision of evidence is designed to avoid token overload in large context models.
This staged pipeline enables high-precision, scalable evaluation and reduction of computational cost, with error detection proceeding from high-confidence calls to deep, targeted analysis only when required (Liu et al., 3 Feb 2026).
3. Metric Suite: Seven Hierarchical Measures
MemGUI-Eval outputs a comprehensive set of evaluative metrics, structured across three interdependent dimensions:
| Dimension | Metric | Formula (as in source) |
|---|---|---|
| Short-Term Memory Fidelity | Success Rate (SR) | |
| Information Retention Rate (IRR) | , across | |
| Memory-Task Proficiency Ratio (MTPR) | ||
| Long-Term Learning | pass@ Success Rate | |
| Failure Recovery Rate (FRR) | ||
| Execution Efficiency | Average Step Ratio | |
| Avg Time/Step & Cost/Step | ; |
These metrics quantify immediate task completion (SR), memory-specific correct recall (IRR), relative memory challenge (MTPR), learning over repeated attempts (pass@k), rapidity of recovery from initial failure (FRR), optimality versus reference trajectories (step ratio), and computational/economic cost (time and API-tokens/step) (Liu et al., 3 Feb 2026).
4. Integration of pass@k: Capturing Cross-Session Learning
MemGUI-Eval operationalizes pass@k evaluation through the underlying snapshot framework. For each task, agents are allowed up to attempts (typically ). After each failed attempt (or budget exhaustion), the task is reset from a predefined snapshot, and MemGUI-Eval reassesses upon each new attempt. This methodology explicitly reveals both immediate and cross-attempt learning. Metrics such as pass@k and FRR (failure recovery rate, harmonically weighted to reward faster recovery) quantitatively register improvements across retries, exposing the degree to which agents can accumulate and utilize experience for memory-intensive tasks.
For example, in long-term learning scenarios, Agent-S2 demonstrates an increase in pass@1 SR from 27.3% to 49.2% at pass@3, with a FRR of 21.5%, indicating that significant recovery occurs between attempts 1 and 2 (Liu et al., 3 Feb 2026).
5. LLM-as-Judge Mechanisms and Prompting
Each evaluation stage in MemGUI-Eval is governed by distinct LLM roles, primarily using Gemini 2.5 (Flash or Pro), with customized system prompts:
- Triage Judge: Mandated to only pass tasks when conditions are "indisputable"; otherwise, defers with "Uncertain."
- Step Descriptor: Converts step-level screenshot-action context into structured JSON representations.
- Semantic Judge: Mandates verification of every goal component, refers unresolved dependencies to Stage 3 by enumerating required evidence.
- IRR Analyzer: Calculates fine-grained information retention for memory task failures, following explicit formulas.
- Visual Judge: Given minimal but necessary historical screenshots and context, must issue a final binary decision.
By integrating LLMs as multi-level graders, MemGUI-Eval systematically balances cost and reliability and constrains the subjectivity often observed with unconstrained LLM-as-judge paradigms (Liu et al., 3 Feb 2026).
6. Diagnosis: Memory Case Studies and Failure Typology
MemGUI-Eval supports fine-grained assessment of cross-temporal (long-term) and cross-spatial (multi-app/screen) memory. For instance, top frameworks achieve 46.4%–50.0% SR on single-app tasks, but fall to 10.0%–30.0% on complex four-app tasks, with commensurate IRR declines. This documents the substantial stress multi-app retention places on short-term memory modules.
Analysis of 343 non-timeout failures yields five principal failure modes:
- Partial Memory Hallucination (PMH): IRR ∈ (0,100), only partial recall.
- Process Memory Hallucination (ProcMH): IRR=0, procedural/context loss.
- Output Memory Hallucination (OMH): IRR=0, failure at output or transcription.
- Knowledge Deficiency (KD): Application/domain knowledge omissions.
- Intent Misunderstanding (IM): Goal/instruction misinterpretation.
Memory hallucinations collectively account for 58.9% of non-timeout failures. These diagnostic categories, along with the "required_steps" triggers and IRR quantification, enable systematic error attribution and guide architecture refinement (Liu et al., 3 Feb 2026).
7. Quantitative Insights and Architectural Implications
MemGUI-Eval empirically exposes large capability gaps in memory-intensive regimes: compared to standard benchmarks, GUI agents exhibit 4–10× larger deficits (MTPR ≈ 0.1–0.45). Framework-based agents (Agent-S2, M3A, T3A) surpass end-to-end models by a significant margin, with SR@1 of 22.7–32.8% versus only 0–6.2% for end-to-end baselines.
The detailed failure patterning and step-level cost/proficiency metrics motivate five core design implications:
- Multi-granularity memory buffers to reduce partial memory hallucinations.
- Hierarchical task decomposition and persistent goal trackers to prevent process/context losses.
- Strategic exploitation of long-context capabilities in models with sizable context windows.
- Persistent, explicit long-term memory for experience reuse and cross-session learning.
- Hybrid architectures that blend classical framework structure with lightweight, end-to-end learning for optimal memory/cost trade-offs.
Top-performing hybrid (M3A) achieves 21.9% pass@3—at only 31% of the token cost of the most memory-robust (Agent-S2) configuration—demonstrating the value of architectural optimization.
MemGUI-Eval's diagnostic, multi-metric, and staged methodology represents a substantial advance in the systematic, resource-aware, and interpretable evaluation of memory in complex, realistic GUI agent settings (Liu et al., 3 Feb 2026).