GUI-Owl-1.5: GUI Reasoning & Automation Suite
- GUI-Owl-1.5 is a multi-faceted suite integrating ontology reasoning, GUI automation, and visual defect detection.
- It utilizes a Java-based OWL reasoner with a JavaFX interface and advanced vision-language agents with transformer architectures.
- OwlEye employs a deep CNN with Grad-CAM for precise UI bug localization and detection, validated through rigorous benchmarks.
GUI-Owl-1.5 is a multi-faceted suite of systems and models for graphical user interface (GUI) reasoning, automation, understanding, and visual defect detection, encompassing three major threads: (1) a modern, Java-based desktop ontology reasoner and editor for OWL (Web Ontology Language); (2) a state-of-the-art vision-language agent family for GUI automation and tool use; and (3) a deep learning system for UI display issue detection and localization. Each subsystem advances its respective area with tailored models, algorithms, and rigorous benchmarks across the desktop and mobile GUI research landscape (Abicht, 2023, Xu et al., 15 Feb 2026, Liu et al., 2020).
1. Model Families and System Architecture
1.1 GUI-Owl: OWL Reasoner and Desktop Toolkit
GUI-Owl version 1.5 is a cross-platform desktop environment for editing, visualizing, and reasoning over OWL ontologies. It is implemented in Java (SE 8+) with a JavaFX GUI, supporting Windows 10/11, Ubuntu 20.04+, and macOS 10.15+ (Abicht, 2023). Its architecture follows an MVC-style pattern, with a JavaFX-based presentation layer, controller layer for event handling and ontology session management, an OWL API 5.x-based reasoning engine adapter (leveraging HermiT 1.4.3, ELK 0.5.0, and Pellet 2.6.5), and a persistence layer for ontology I/O. User-facing features include a tree-based ontology explorer, drag-and-drop canvas editor, SPARQL query builder, result panels, batch reasoning, and plugin management.
1.2 GUI-Owl-1.5: Multi-Platform Vision-Language Agent Suite
GUI-Owl-1.5, as described in (Xu et al., 15 Feb 2026), is a family of open-source vision-LLMs built on Qwen3-VL backbones with two specialized heads:
- Instruct Head: Lightweight, for low-latency, high-throughput action conclusion and tool-calling.
- Thinking Head: Deeper, chain-of-thought (CoT) generation with structured intermediate reasoning steps and enhanced planning, implemented as a transformer decoder stacked atop the backbone.
Model sizes include 2B, 4B, 8B, 32B, and 235B-A22B parameters, deployed across desktop, mobile, browser, and cloud-edge collaborative platforms.
1.3 GUI-Owl-1.5 ("OwlEye"): Deep Learning for Visual UI Bug Detection
The OwlEye 1.5 system is a standalone deep convolutional neural network (CNN) for GUI screenshot-based display issue detection (Liu et al., 2020). It consists of 12 convolutional layers (with batch normalization and ReLU activations), six max-pooling layers, and four fully-connected layers leading to a binary classifier ("normal UI" vs. "display issue"), post-processed by Grad-CAM for bug localization.
2. Training Pipelines and Data Sources
2.1 Hybrid Data Flywheel in Vision-Language Agents
GUI-Owl-1.5 agent models are trained via a hybrid data flywheel combining:
- Real-device rollouts: Cloud sandboxed device exploration with checkpointing and subtask validation.
- Virtual environment trajectories: GUI tasks sampled in web-rendered or simulated environments, generating atomic tool-use or RPA scripts.
Tasks are specified as paths through directed acyclic graphs (DAGs), and training batches maintain a real:virtual ratio (α ≈ 0.3). Data quality is reinforced with interface-consistency loss and filtered by per-subtask critic functions (Xu et al., 15 Feb 2026).
2.2 Thought-Synthesis and CoT Augmentation
To ensure step-wise reasoning traces even for supervised data, a dedicated pipeline synthesizes thought tokens per agent step. The pipeline combines VLM-generated observations, memory updates, LLM-based reflections, and step conclusion synthesis, appending the results to standard instruction-following data.
2.3 Centralized Dataset Construction in OwlEye
OwlEye 1.5’s dataset comprises:
- 4,470 manually labeled crowdtesting screenshots with UI bugs (component occlusion, text overlap, missing image, NULL value, blurred screens) and 4,470 controls, from 562 Android crowdtesting tasks.
- Extensive heuristics-based data augmentation, producing an additional 7,820 synthetic bug images and 7,820 synthetic “bug-free” controls.
- The dataset is split by application into training (21,980), validation (1,000), and test sets (1,600) (Liu et al., 2020).
3. Algorithms, Reasoning, and Learning
3.1 OWL Reasoning Engines
GUI-Owl 1.5 delegates inference to several established reasoners:
- Tableau/Hypertableau (HermiT): Supports full OWL 2 DL (SHOIQ(D)), offering 2-NExpTime-complete consistency checking.
- ELK: Completion-rule algorithm for OWL 2 EL, yielding practical O(n³) or better classification.
- Rule-Based Engine (OWL 2 RL): Forward-chaining datalog with PTIME data complexity.
GUI-Owl supports TBox classification, ontology consistency, instance checks, SPARQL query answering, and justification/explanation for entailments (Abicht, 2023).
3.2 MRPO: Multi-Platform RL Policy Optimization
For GUI agent RL, GUI-Owl-1.5 introduces MRPO, a grouped PPO variant accommodating device-conditioned policies. It employs oversample-then-subsample groupings to maintain on-policy guarantees and avoid collapse, with alternating stage-wise training per platform (mobile, desktop, web) to reduce gradient interference.
Groupings:
- Group size , oversample factor per batch.
- PPO-style surrogate loss with clipping.
- Alternating gradient steps per device family (Xu et al., 15 Feb 2026).
3.3 CNN for Visual Bug Detection
OwlEye 1.5 passes GUI screenshots through 12 conv layers, followed by four FC layers, applying cross-entropy for binary defect classification. Grad-CAM is used for post-hoc region localization. Training leverages image-level labels; bounding-box supervision is not used (Liu et al., 2020).
4. Benchmarks and Empirical Results
4.1 OWL Reasoner and Editor
On a 10K-class, 20K-axiom biomedical ontology, GUI-Owl 1.5 achieves:
| Reasoner | Classification (s) | Memory (MB) |
|---|---|---|
| HermiT | 4.8 | 380 |
| ELK | 1.2 | 220 |
| Pellet | 6.5 | n/a |
For standard ontologies: GALEN (15K axioms): 12.7s; SWRC (1.2K): 0.45s. GUI integration adds 10–15% overhead to CLI runs, while ELK-based classification scales nearly linearly to 4 CPU cores. Batch mode and headless operation are supported (Abicht, 2023).
4.2 Agent Model Performance
In live GUI automation, grounding, and tool-usage benchmarks, GUI-Owl-1.5 achieves:
| Model Variant | OSWorld | AndroidWorld | WebArena | ScreenSpotPro | OSWorld-MCP | MobileWorld |
|---|---|---|---|---|---|---|
| 8B-Inst | 52.3 | 69.0 | 45.7 | 39.4 | 41.8 | 41.8 |
| 8B-Think | 52.9 | 71.6 | 46.7 | 40.8 | 38.8 | 33.3 |
| 32B-Inst | 56.5 | 69.8 | – | 46.6 | 47.6 | 46.8 |
| UI-TARS-2 (72B) | 53.1 | 73.3 | 40.2 | 52.9 | – | – |
On ScreenSpotPro for high-res grounding, 32B-Instruct reaches 80.3%, surpassing prior best 70.9%. GUI Knowledge Bench score: 75.45 (32B-Inst); O3: 73.30; Gemini-2.5-Pro: 71.69 (Xu et al., 15 Feb 2026).
Ablation findings: Excluding virtual environment data drops PC-Eval from 75.4% to 42.0%. Unified CoT removal reduces OSWorld and AndroidWorld scores significantly (52.9% → 47.4% and 71.6% → 65.0%, respectively).
4.3 Display Bug Detection and Localization
On a 1,600-image test set:
- Precision: 0.85, Recall: 0.848, F1: 0.849.
- By category (P/R): occlusion (0.859/0.814), text overlap (0.818/0.806), missing image (0.855/0.904), NULL value (0.855/0.808), blurred (0.888/0.800).
Against 13 baselines (including SIFT/SURF/ORB + SVM/KNN/NB/RF, and an 8-layer MLP), GUI-Owl-1.5 outperforms by ≥17% recall and ≥50% precision. Grad-CAM-based localization is rated as correct in 90% of cases by end-users (Kendall’s W = 0.946). In a deployment across 329 Android apps, 57 with display issues were detected; 26 issues confirmed or fixed, including popular apps (Perfect Piano, Thunder VPN, etc.) (Liu et al., 2020).
5. Supported Features, User Interaction, and Limitations
5.1 GUI-Owl Desktop Toolkit
Key user-facing features (Abicht, 2023):
- Ontology Explorer: browsable class/property tree with unsatisfiable node highlighting.
- Visual Ontology Canvas: interactive editing of class intersections, property restrictions.
- SPARQL Builder: SELECT/WHERE pattern fields with IRI auto-completion.
- Explanation View: minimal axiom set explanations.
- Batch reasoning and plugin management.
- Live OWL 2 profile conformance validation.
Known limitations:
- OWL 2 RL is restricted to rule fragments; large ABox queries are slow.
- Pellet adapter (2.6.5) issues warnings under Java 11+.
- Performance degrades with >10 simultaneous ontologies (memory leak in Canvas Editor).
- No OWL 2 Full/probabilistic reasoning.
- Partial explanations of n-ary property chains (issue #142, pending fix in v1.6).
5.2 Agent Suite and Cloud Demo
- Supports device- and platform-conditioned policy optimization.
- 8B-Inst (laptop/edge) and 32B-Think (cloud) collaborative inference.
- Live cloud-sandbox demo: https://github.com/X-PLUG/MobileAgent (Xu et al., 15 Feb 2026).
5.3 OwlEye Defect Detection
- Integration with DroidBot for automated screenshot collection and bug reporting.
- No bounding-box supervision; localization is via Grad-CAM.
- Demonstrates cross-platform and multilingual generality with >70% initial cross-platform accuracy.
6. Comparative Context and Impact
GUI-Owl-1.5 systems collectively advance the state of the art in GUI understanding and automation:
- The OWL reasoner/editor establishes a robust standard for ontology reasoning environments, while maintaining moderate overhead and batch/high-performance capabilities.
- The agent suite surpasses previous models in multi-device GUI automation (e.g., outperforming UI-TARS-2 and O3 in both automation and knowledge-intensive tasks).
- OwlEye 1.5 raises detection and localization precision/recall substantially over both traditional hand-engineered and earlier deep learning baselines, reflecting superior architectural depth and training data augmentation (Abicht, 2023, Xu et al., 15 Feb 2026, Liu et al., 2020).
Deployment in real-world app audits confirms practical impact, with defect detections leading to confirmed fixes in commercial applications. Integration of chain-of-thought and environment RL scaling provides a foundation for further research in explainable, generalizable agentic GUI systems.
7. Reproducibility and Resources
- All agent code and pretrained models are open-sourced under Apache-2.0.
- Comprehensive documentation, user guides, and live cloud demos are provided online.
- Pretraining: 1.2T tokens (UI, QA/VQA, world-model); SFT: 300M turns; MRPO RL: ~100M environment steps per device (Xu et al., 15 Feb 2026).
- For the OWL reasoner GUI: source, documentation, and plugins are accessible at the specified GitHub and documentation URLs (Abicht, 2023). The OwlEye code and dataset structure support turnkey deployment and evaluation, with full data splits and augmentation scripts available (Liu et al., 2020).