FocusUI: Efficient UI Grounding

Updated 4 July 2026

FocusUI is a framework for efficient UI grounding that uses instruction-guided token selection and PosPad to preserve positional continuity in high-resolution screenshots.
It reduces computational burden by significantly cutting visual tokens, which results in faster inference and lower GPU memory usage while maintaining spatial structure.
Empirical benchmarks across mobile, desktop, and web interfaces demonstrate its robust performance in accurately localizing UI elements on dense and complex layouts.

Searching arXiv for the FocusUI paper and closely related GUI grounding work to support the article with up-to-date citations. {"query":"ti:\"FocusUI\" OR abs:\"FocusUI\" GUI grounding efficient UI grounding", "max_results": 10} FocusUI is a framework for efficient UI grounding that reduces the computational burden of processing high-resolution screenshots while preserving the positional structure needed for precise localization. It addresses a specific failure mode in GUI-oriented vision-language modeling: standard visual token pruning can reduce cost, but it often disrupts positional continuity and degrades grounding accuracy on dense, high-resolution interfaces. FocusUI therefore combines instruction-aware visual token selection with a position-preserving compression strategy, PosPad, and reports strong accuracy–efficiency tradeoffs on four grounding benchmarks (Ouyang et al., 7 Jan 2026).

1. Definition, scope, and naming

FocusUI is the formal name of the method introduced in "FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection". Its target task is UI grounding: given a screenshot and a natural-language instruction, the model must localize the corresponding UI element or region. The framework is not a general-purpose UI design system, nor is it a generic visual token pruning method. Its scope is narrower and more technical: efficient multimodal inference for grounding on high-resolution graphical user interfaces.

A common point of confusion is the relation between FocusUI and Focus. The latter is a different framework, introduced in "Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems", which combines fast prediction with staged reasoning through interface summarization and focused analysis. FocusUI, by contrast, is centered on token-efficiency and positional continuity rather than dual-system reasoning (Tang et al., 9 Mar 2025).

The problem setting is motivated by the scale of visual tokenization in modern VLMs. A 2K screenshot can produce about 4700 visual tokens, and some evaluation settings exceed 5000 visual tokens per sample. In the reported analysis, visual tokens account for at least 84.3% and often more than 95% of the full multimodal sequence, which increases prefill cost, decoding-time memory, latency, and GPU memory consumption (Ouyang et al., 7 Jan 2026).

2. Problem formulation and failure mode

FocusUI treats efficient UI grounding as a distinct research problem because UI grounding is unusually sensitive to spatial precision. Screenshots are high resolution, interfaces are structurally dense, and the difference between a correct and incorrect prediction may be only a few pixels or patches. The method therefore begins from two observations.

First, most screenshot patches are irrelevant to a given instruction. Second, simply dropping low-scoring patches is not sufficient, because UI grounding depends on preserving the positional progression of the original visual sequence. The paper attributes this to the use of Multimodal Rotary Position Embeddings (M-RoPE) in models such as Qwen2-VL: direct deletion creates positional jumps in the flattened patch sequence and distorts height/width positional structure.

This yields the central claim of FocusUI: efficient UI grounding requires both relevance filtering and position preservation. The paper operationalizes this in two stages. A Query-Guided Saliency Scorer predicts patch relevance conditioned on the instruction. Then PosPad compresses dropped contiguous spans into special markers placed at the last index of each span, preserving sequence continuity.

The patch grid is defined by image height $H$ , width $W$ , and patch size $p$ , with

$G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$

For Qwen2.5-VL-based models, the paper uses $p=14$ ; for Qwen3-VL-based models, $p=16$ . Each patch receives a learned saliency score, and for retention ratio $r$ , the model keeps $K=\lfloor rM \rfloor$ visual tokens, where $M$ is the original token count (Ouyang et al., 7 Jan 2026).

3. Saliency supervision and PosPad

The saliency mechanism is trained with a hybrid dense target called the Instruction-to-Patch saliency score. This score fuses two signals.

The first is an instruction-conditioned bounding-box score:

$S_{\mathrm{bbox}[i,j]} = \frac{\mathrm{area}(R_{i,j} \cap b_{gt})}{p^2},$

where $W$ 0 is the patch cell and $W$ 1 is the ground-truth target box. This yields soft supervision: fully covered patches score $W$ 2, disjoint patches score $W$ 3, and boundary patches receive intermediate values.

The second is a rule-based UI-graph score designed to suppress large homogeneous regions. Adjacent patches are grouped by a union-find procedure when their pixel vectors are sufficiently similar. For each connected component $W$ 4 with size $W$ 5, the assigned weight is

$W$ 6

Large uniform regions therefore receive smaller weights, while small distinctive regions receive larger ones.

The fused supervision is

$W$ 7

with $W$ 8. The scorer is trained by KL divergence:

$W$ 9

PosPad is the second major component. After top- $p$ 0 selection, the dropped visual tokens are partitioned into maximal contiguous spans. Each dropped span is replaced by a single learnable marker token, $p$ 1, placed at the last index of that span. The transformed sequence therefore retains all kept tokens plus one marker per dropped span. If $p$ 2 tokens are dropped and these form $p$ 3 contiguous spans, the new sequence length is

$p$ 4

This does not preserve every original token, but it preserves the sequence’s coarse positional progression in a way that standard deletion does not. A supplementary ablation compares sequence-first, sequence-middle, and sequence-end placement and finds sequence-end consistently best, especially at low retention (Ouyang et al., 7 Jan 2026).

4. Training setup, losses, and implementation

FocusUI is implemented on Qwen2.5-VL and Qwen3-VL backbones, with reported variants including FocusUI-3B, FocusUI-7B, and a Qwen3-VL-2B version. The training corpus follows GUI-Actor and includes about 1M screenshots from public GUI datasets: UGround, GUI-Env, GUI-Act, AndroidControl, AMEX, and Wave-UI. The raw set contains 1,012K screenshots and 9.6M elements; after OmniParser-based annotation checking, the filtered set contains 976K screenshots and 7.4M elements. The filtering removes samples whose IoU between ground-truth and OmniParser-detected boxes is below 0.3, eliminating 22.9% of elements (Ouyang et al., 7 Jan 2026).

The overall training loss is

$p$ 5

Here, $p$ 6 is the standard next-token prediction loss, while $p$ 7 aligns grounding attention with target patch overlap. The attention module transforms the action-related hidden state $p$ 8 and selected visual features $p$ 9, computes attention scores

$G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 0

and supervises the resulting distribution against normalized overlap labels.

Training is two-stage: saliency scorer pretraining for 1 epoch with learning rate $G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 1, followed by full-model fine-tuning for 1 epoch with learning rate $G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 2. Other implementation details include DeepSpeed ZeRO-2, FlashAttention-2, bfloat16, and 8× NVIDIA H200 GPUs. Reported training times are about 12 hours for stage 1 and about 36 hours for stage 2 on the 3B model, or 48 hours for the 7B model. During training, the retention ratio $G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 3 is sampled uniformly from $G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 4, and the UI-graph threshold is $G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 5 (Ouyang et al., 7 Jan 2026).

5. Benchmarks and empirical results

FocusUI is evaluated on four UI grounding benchmarks: ScreenSpot-V2 with 1,272 samples across mobile, desktop, and web; ScreenSpot-Pro with 1,581 samples from 23 professional applications; OS-World-G with 564 samples; and UI-Vision with 5,790 desktop-centric samples from 83 applications. ScreenSpot-Pro is especially important because it combines high-resolution screenshots and complex layouts; its average resolution is 3267×1727, and its maximum is 6016×3384 (Ouyang et al., 7 Jan 2026).

The headline result is on ScreenSpot-Pro: FocusUI-7B achieves 48.3 at $G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 6, compared with 44.6 for GUI-Actor-7B, a gain of 3.7 points. At 30% visual token retention, FocusUI-7B reaches 45.1, which is only a 3.2-point drop from the full-token setting. The efficiency table reports that on Qwen2.5-VL this corresponds to 1.44× faster inference and 17% lower peak GPU memory.

The broader pattern is similar across benchmarks. For FocusUI-7B:

ScreenSpot-V2: 93.1 at $G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 7, 91.8 at $G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 8
ScreenSpot-Pro: 48.3 at $G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.$ 9, 45.1 at $p=14$ 0
OS-World-G: 54.4 at $p=14$ 1, 53.9 at $p=14$ 2
UI-Vision: 24.9 at $p=14$ 3, 23.8 at $p=14$ 4

The paper’s most pointed comparison is against generic pruning methods. On ScreenSpot-Pro at 30% retention, a dense FocusUI baseline scores 43.8, while FocusUI with saliency scorer + PosPad scores 40.6. By contrast:

Fast-V: 4.8
HiPrune: 18.0
Vision-Zip: 18.9

This is the paper’s primary evidence that generic visual pruning fails on UI grounding because of broken positional information. Ablations reinforce the same conclusion. On ScreenSpot-Pro with 50% retention:

CLIP + direct drop: 28.5
CLIP + full padding: 38.7
CLIP + PosPad: 38.2
Ins2Patch + direct drop: 29.2
Ins2Patch + full padding: 42.1
Ins2Patch + PosPad: 42.3

The saliency fusion itself is also validated: at 50% reduction on ScreenSpot-Pro, UI-graph labeling only gives 41.1, bbox-based labeling only gives 39.8, and fused supervision gives 42.3 (Ouyang et al., 7 Jan 2026).

6. Position within UI grounding research

FocusUI belongs to a broader line of work that replaces generic multimodal assumptions with UI-specific inductive biases, but its particular contribution is efficiency rather than interface summarization, action-conditioned representation learning, or explicit UI-element reasoning. This distinguishes it from several adjacent methods.

Relative to Focus, which decomposes grounding into interface summarization, focused visual analysis, and precise coordinate prediction, FocusUI does not add a slow reasoning path; it instead reduces token cost while preserving localization quality. Relative to UILoop, which argues for a Screen–UI Elements–Action paradigm and explicitly supervises localization, semantic function, and practical usage of key UI elements, FocusUI remains a grounding framework rather than a full element-centric reasoning system. A plausible implication is that these approaches are complementary: FocusUI addresses the efficiency bottleneck of high-resolution screenshots, while element-centric methods address interpretability and semantic decomposition (Li et al., 8 Apr 2026).

The paper states one main limitation explicitly: current efficiency gains come primarily from spatial token reduction, whereas GUI interaction often unfolds over temporal or multi-round sequences. Future work is therefore suggested in the temporal dimension. More broadly, the method’s analysis indicates that precision-sensitive GUI tasks cannot be treated as ordinary visual token pruning problems. Its durable contribution is the argument that in UI grounding, position-aware token reduction is not an optimization detail but a task requirement (Ouyang et al., 7 Jan 2026).