OGRBench: GUI Outcome Reward Benchmark

Updated 4 July 2026

OGRBench is a trajectory-level evaluation benchmark for GUI outcome reward modeling that judges full-interaction success.
It aggregates real GUI trajectories from diverse platforms including Android, Ubuntu, Windows, macOS, and Web to ensure cross-platform diversity.
The benchmark assesses critic performance using binary success indicators and metrics such as accuracy, precision, recall, and F1 score.

Searching arXiv for OGRBench and closely related GUI reward benchmarking papers to ground the article in the cited literature. {"query":"OmniGUIRewardBench OGRBench GUI reward benchmark OS-Themis StainFlow", "max_results": 10} arxiv_search(query="OmniGUIRewardBench OGRBench GUI reward benchmark OS-Themis StainFlow", max_results=10) OmniGUIRewardBench, commonly abbreviated OGRBench, is a benchmark for trajectory-level GUI outcome reward modeling. It was introduced as the evaluation counterpart to "OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards" and is defined there as the first “holistic cross-platform ORM benchmark spanning Mobile, Web, and Desktop environments”. In later work, "StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents" describes OGRBench as “a cross-platform GUI trajectory-level reward benchmark” used for “inference-time trajectory completion judgment.” In both formulations, the core task is the same: given an instruction and a full GUI trajectory, determine whether the task was successfully completed (Li et al., 19 Mar 2026, Hao et al., 5 Jun 2026).

1. Historical position and benchmark identity

OGRBench emerged from a specific gap in GUI-agent evaluation. The OS-Themis paper argues that reinforcement learning for GUI agents is highly sensitive to reward quality, yet prior evaluation practice for GUI rewards was fragmented, narrow in domain, and often misaligned with the actual trajectory-level outcome judgment required for online RL and self-improvement loops. Existing GUI benchmarks largely measured execution success rather than critic quality, while the few critic-oriented benchmarks were described as platform-limited, step-focused, or susceptible to leakage and contamination. OGRBench was introduced to address this problem directly by evaluating whether a critic can infer the final success or failure of a full GUI interaction trajectory (Li et al., 19 Mar 2026).

The benchmark’s identity is therefore narrower than a general GUI-agent benchmark and broader than a platform-specific critic testbed. It is not a benchmark for policy execution itself, nor a dataset of dense process rewards. It is an outcome reward model benchmark centered on full-trajectory binary judgment. StainFlow reuses OGRBench in exactly this role: not for online RL training, but as an offline benchmark for whether a reward-model-like verifier can read an entire GUI trajectory and correctly decide whether the task was completed (Hao et al., 5 Jun 2026).

A common misconception is that OGRBench is a benchmark for online training reward shaping. The published description is more specific. In the StainFlow study, AndroidWorld is used for online RL, whereas OGRBench is used only for offline trajectory completion judgment (Hao et al., 5 Jun 2026).

2. Dataset composition and provenance

OGRBench is built from real GUI trajectories sampled from five existing benchmark environments spanning five platform categories. The source benchmarks and platform mapping are:

AndroidWorld for Android/mobile
OSWorld for Ubuntu
WindowsAgentArena for Windows
macOSArena for macOS
WebArena-Lite-v2 for Web

The labels are binary and are automatically determined by each benchmark’s built-in evaluation rules rather than by human trajectory-level annotation. The benchmark therefore inherits success/failure labels from the source environments (Li et al., 19 Mar 2026).

The reported dataset size is 1,409 trajectories total, with 700 positive and 709 negative samples. The appendix statistics in OS-Themis give the following platform-wise counts:

Platform / Source	Positive	Negative
OSWorld / Ubuntu	393	348
AndroidWorld / Android	98	90
WindowsAgentArena / Windows	94	119
MacOSArena / macOS	16	61
WebArena / Web	99	91

The authors state that they use stratified sampling to ensure broad task coverage while keeping the overall positive ratio between 0.45 and 0.55. macOS is the stated exception: its positive/negative ratio is imbalanced because current models rarely solve those tasks, and the benchmark preserves task diversity rather than artificially balancing the split by oversampling similar positives or discarding negatives (Li et al., 19 Mar 2026).

Trajectory diversity is also enforced through the choice of generating agents. The OS-Themis paper lists trajectories produced by Qwen3-VL series (4B, 8B, 235B), UITARS variants (1.5-7B, 72B-DPO), ScaleCUA (7B, 32B), and Claude-Sonnet-4.5. This suggests that OGRBench is intended to test critic generalization across heterogeneous trajectory styles rather than overfitting to one policy family (Li et al., 19 Mar 2026).

3. Task definition and evaluation protocol

OGRBench is a trajectory-level binary classification benchmark. Each example is a full GUI trajectory, represented by the interaction sequence together with the agent’s outputs. In OS-Themis, the trajectory is formalized as

$\tau = \{(s_t, a_t, m_t)\}_{t=1}^T,$

where $s_t$ is the state or screenshot, $a_t$ is the action, and $m_t$ is agent-generated metadata at step $t$ . The benchmark asks a reward framework to predict whether the task succeeded overall (Li et al., 19 Mar 2026).

StainFlow writes the completion-judgment problem in the form

$R(\tau)\in\{0,1\},\quad (r_0,\ldots,r_{T-1},\hat{y})=\phi(g,\tau), \quad r_t\in[0,1],$

where OGRBench is evaluated through the final prediction $\hat{y}$ , not through RL training. This formalization makes explicit that a method may internally produce dense step rewards while OGRBench itself remains an outcome-level evaluation benchmark (Hao et al., 5 Jun 2026).

The reported metrics are Accuracy, Precision, Recall, and F1 overall, plus per-subset Accuracy and F1. The benchmark does not provide gold step-level rewards, partial-credit labels, or milestone annotations as dataset supervision. Those structures, when present, are generated by the evaluated framework at inference time. The OS-Themis paper also does not specify train/validation/test splits for OGRBench; it presents the resource as an evaluation set (Li et al., 19 Mar 2026).

Another common misconception is that OGRBench contains gold process decompositions. The published benchmark description states the opposite: milestone chains and intermediate evidence structures belong to the critic framework, not to the benchmark annotation itself (Li et al., 19 Mar 2026).

4. Methodological role in GUI reward modeling

OGRBench is designed to test a specific failure mode of GUI reward modeling: the difficulty of making reliable whole-trajectory outcome judgments in long-horizon, stochastic, cross-platform environments. The OS-Themis paper frames the problem as one of critic design. Its solution decomposes a trajectory into milestones, verifies them locally, audits the completeness of the milestone set, and then aggregates the deliberation into a final binary reward decision. The benchmark is the main instrument used to validate that design (Li et al., 19 Mar 2026).

StainFlow uses OGRBench to argue for a different mechanism: entity-centered process evidence organization rather than milestone scripts or fixed local judging windows. In that work, OGRBench is explicitly an offline testbed for whether a verifier can integrate instruction $g$ , full trajectory $\tau$ , verified key nodes, final entity state, and tail evidence into an accurate completion prediction $\hat{y}$ . The benchmark is thus treated as a stress test for two capabilities: tolerating multiple valid execution paths and recovering long-range visual evidence (Hao et al., 5 Jun 2026).

The StainFlow analysis is especially important for understanding why OGRBench is difficult. It reports that the evidence span needed for candidate key-node verification is substantially larger than a one-step neighborhood: average spans exceed 10 on Ubuntu, Android/Mobile, Windows, and MacOS, reach 14.1 on MacOS, and exceed 7.6 even on Web. This supports the claim that fixed local windows are mismatched to OGRBench-style trajectory judgment (Hao et al., 5 Jun 2026).

A plausible implication is that OGRBench is less a benchmark of single-frame correctness than a benchmark of evidence aggregation under temporal dispersion. That interpretation is consistent with both OS-Themis and StainFlow, though the benchmark itself remains outcome-labeled rather than process-labeled.

5. Reported results and what they demonstrate

In the OS-Themis study, OGRBench is used to compare three judge frameworks: DigiRL, ZeroGUI, and OS-Themis. The headline result is that all evaluated models achieve their best performance under OS-Themis. Averaged across evaluated backbones, OS-Themis outperforms DigiRL by +18.8% accuracy, +29.6% precision, +16.9% recall, and +26.2% F1, and outperforms ZeroGUI by +7.7% accuracy, +5.1% precision, +13.0% recall, and +13.4% F1. The strongest single configuration reported there is Qwen3-VL-235B under OS-Themis, with 88.0 accuracy, 92.8 precision, 82.3 recall, and 87.2 F1 (Li et al., 19 Mar 2026).

The OS-Themis paper also reports cross-platform means under OS-Themis of 80.3 Acc / 78.5 F1 on Ubuntu, 84.0 / 82.1 on Mobile, 79.4 / 72.0 on Windows, 89.9 / 70.3 on macOS, and 85.3 / 82.1 on Web. The appendix statistics further indicate that only a subset of trajectory steps are outcome-critical: across all 1,409 tasks under Qwen3-VL-235B, the authors report 27,882 total steps, 9,918 total milestones, 19.79 average steps per task, 7.04 average milestones per task, and a 35.57% global step-level milestone ratio (Li et al., 19 Mar 2026).

In the later StainFlow study, OGRBench is reused with a different evaluator set: Qwen3-VL-8B, Qwen3.5-VL-9B, Qwen3.5-VL-27B, GPT-5, and Gemini-3-Flash. There, the strongest reported result is StainFlow with Gemini-3-Flash, reaching 88.2 overall accuracy and 88.2 F1. The paper states this as a 1.8% relative accuracy gain over the strongest baseline OS-Themis, whose corresponding Gemini-3-Flash configuration obtains 86.6 overall accuracy. With the same verifier, StainFlow reports 87.5 precision and 89.0 recall, compared with OS-Themis at 88.4 precision and 84.1 recall; the method therefore trades 0.9 points of precision for 4.9 points of recall and achieves the best overall F1 (Hao et al., 5 Jun 2026).

Subset-level results in StainFlow further show that, with Gemini-3-Flash, the method is best on Ubuntu, Mobile, MacOS, and Web, while remaining competitive but not best on Windows. This matters because it demonstrates that OGRBench is sensitive not only to overall critic quality but also to the way evidence is organized across different GUI ecosystems (Hao et al., 5 Jun 2026).

Because these two result sets come from different studies with different evaluated backbones and frameworks, they should not be collapsed into a single unified leaderboard. What they jointly establish is narrower and more important: OGRBench is sufficiently challenging that improvements in critic architecture, evidence selection, and trajectory reasoning protocol measurably change benchmark performance.

6. Relation to adjacent benchmarks and principal limitations

OGRBench occupies a distinct niche among GUI and multimodal evaluation resources. "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents" is a broad benchmark for GUI automation agents across Windows, macOS, Linux, iOS, Android, and Web, with four levels ranging from GUI content understanding to task collaboration and an efficiency-sensitive online metric, EQA. However, it is not a reward-model benchmark and does not provide explicit pairwise trajectory preferences or dense reward annotations (Wang et al., 25 Jul 2025). By contrast, OGRBench is specifically about critic correctness on full GUI trajectories.

Outside the GUI domain, several reward benchmarks provide useful methodological comparisons but are not domain matches. "Multimodal RewardBench 2" is a benchmark for omni reward models on interleaved text and image tasks and is explicitly not itself a GUI benchmark (Hu et al., 18 Dec 2025). "Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences" introduces Omni-RewardBench, which covers text, image, video, audio, and 3D, but does not define OGRBench and does not include GUI tasks (Jin et al., 27 Oct 2025). "Omni-RRM" proposes rubric-grounded omni reward modeling across text, image, video, and audio, again without GUI trajectory evaluation (Kong et al., 31 Jan 2026). "EQA-RM" and EQARewardBench show how a domain-specific reward benchmark can be built around trajectory-conditioned multimodal judgment in embodied QA rather than GUI interaction (Chen et al., 12 Jun 2025). These neighboring efforts suggest broader benchmark design patterns, but OGRBench remains unusual in targeting cross-platform GUI outcome rewards directly.

The published limitations of OGRBench are correspondingly specific. First, the benchmark is outcome-focused: it evaluates binary success/failure judgment rather than dense process rewards, calibrated uncertainty, or graded partial credit (Li et al., 19 Mar 2026). Second, its labels are inherited from source benchmark evaluators, not from human adjudication, so it does not report inter-annotator agreement (Li et al., 19 Mar 2026). Third, the benchmark does not specify official train/dev/test splits or a protected evaluation server (Li et al., 19 Mar 2026). Fourth, platform balance is imperfect, especially on macOS, where positives are scarce because current models rarely solve the tasks (Li et al., 19 Mar 2026). Fifth, the benchmark does not by itself expose step-level failure categories such as grounding error, planning error, or action-format error; those distinctions are left to evaluated frameworks.

These constraints also define OGRBench’s significance. It does not attempt to solve every aspect of GUI-agent evaluation. Instead, it standardizes one problem that RL-based GUI systems urgently need solved: reliable trajectory-level outcome judgment across heterogeneous GUI environments. That focus explains why OGRBench has become a natural evaluation target for frameworks such as OS-Themis and StainFlow, and why it is best understood as a benchmark for generalist GUI rewards, not for GUI control alone (Li et al., 19 Mar 2026, Hao et al., 5 Jun 2026).