Papers
Topics
Authors
Recent
Search
2000 character limit reached

FocusUI: Efficient UI Grounding

Updated 4 July 2026
  • FocusUI is a framework for efficient UI grounding that uses instruction-guided token selection and PosPad to preserve positional continuity in high-resolution screenshots.
  • It reduces computational burden by significantly cutting visual tokens, which results in faster inference and lower GPU memory usage while maintaining spatial structure.
  • Empirical benchmarks across mobile, desktop, and web interfaces demonstrate its robust performance in accurately localizing UI elements on dense and complex layouts.

Searching arXiv for the FocusUI paper and closely related GUI grounding work to support the article with up-to-date citations. {"query":"ti:\"FocusUI\" OR abs:\"FocusUI\" GUI grounding efficient UI grounding", "max_results": 10} FocusUI is a framework for efficient UI grounding that reduces the computational burden of processing high-resolution screenshots while preserving the positional structure needed for precise localization. It addresses a specific failure mode in GUI-oriented vision-language modeling: standard visual token pruning can reduce cost, but it often disrupts positional continuity and degrades grounding accuracy on dense, high-resolution interfaces. FocusUI therefore combines instruction-aware visual token selection with a position-preserving compression strategy, PosPad, and reports strong accuracy–efficiency tradeoffs on four grounding benchmarks (Ouyang et al., 7 Jan 2026).

1. Definition, scope, and naming

FocusUI is the formal name of the method introduced in "FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection". Its target task is UI grounding: given a screenshot and a natural-language instruction, the model must localize the corresponding UI element or region. The framework is not a general-purpose UI design system, nor is it a generic visual token pruning method. Its scope is narrower and more technical: efficient multimodal inference for grounding on high-resolution graphical user interfaces.

A common point of confusion is the relation between FocusUI and Focus. The latter is a different framework, introduced in "Think Twice, Click Once: Enhancing GUI Grounding via Fast and Slow Systems", which combines fast prediction with staged reasoning through interface summarization and focused analysis. FocusUI, by contrast, is centered on token-efficiency and positional continuity rather than dual-system reasoning (Tang et al., 9 Mar 2025).

The problem setting is motivated by the scale of visual tokenization in modern VLMs. A 2K screenshot can produce about 4700 visual tokens, and some evaluation settings exceed 5000 visual tokens per sample. In the reported analysis, visual tokens account for at least 84.3% and often more than 95% of the full multimodal sequence, which increases prefill cost, decoding-time memory, latency, and GPU memory consumption (Ouyang et al., 7 Jan 2026).

2. Problem formulation and failure mode

FocusUI treats efficient UI grounding as a distinct research problem because UI grounding is unusually sensitive to spatial precision. Screenshots are high resolution, interfaces are structurally dense, and the difference between a correct and incorrect prediction may be only a few pixels or patches. The method therefore begins from two observations.

First, most screenshot patches are irrelevant to a given instruction. Second, simply dropping low-scoring patches is not sufficient, because UI grounding depends on preserving the positional progression of the original visual sequence. The paper attributes this to the use of Multimodal Rotary Position Embeddings (M-RoPE) in models such as Qwen2-VL: direct deletion creates positional jumps in the flattened patch sequence and distorts height/width positional structure.

This yields the central claim of FocusUI: efficient UI grounding requires both relevance filtering and position preservation. The paper operationalizes this in two stages. A Query-Guided Saliency Scorer predicts patch relevance conditioned on the instruction. Then PosPad compresses dropped contiguous spans into special markers placed at the last index of each span, preserving sequence continuity.

The patch grid is defined by image height HH, width WW, and patch size pp, with

Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.

For Qwen2.5-VL-based models, the paper uses p=14p=14; for Qwen3-VL-based models, p=16p=16. Each patch receives a learned saliency score, and for retention ratio rr, the model keeps K=⌊rMāŒ‹K=\lfloor rM \rfloor visual tokens, where MM is the original token count (Ouyang et al., 7 Jan 2026).

3. Saliency supervision and PosPad

The saliency mechanism is trained with a hybrid dense target called the Instruction-to-Patch saliency score. This score fuses two signals.

The first is an instruction-conditioned bounding-box score:

Sbbox[i,j]=area(Ri,j∩bgt)p2,S_{\mathrm{bbox}[i,j]} = \frac{\mathrm{area}(R_{i,j} \cap b_{gt})}{p^2},

where WW0 is the patch cell and WW1 is the ground-truth target box. This yields soft supervision: fully covered patches score WW2, disjoint patches score WW3, and boundary patches receive intermediate values.

The second is a rule-based UI-graph score designed to suppress large homogeneous regions. Adjacent patches are grouped by a union-find procedure when their pixel vectors are sufficiently similar. For each connected component WW4 with size WW5, the assigned weight is

WW6

Large uniform regions therefore receive smaller weights, while small distinctive regions receive larger ones.

The fused supervision is

WW7

with WW8. The scorer is trained by KL divergence:

WW9

PosPad is the second major component. After top-pp0 selection, the dropped visual tokens are partitioned into maximal contiguous spans. Each dropped span is replaced by a single learnable marker token, pp1, placed at the last index of that span. The transformed sequence therefore retains all kept tokens plus one marker per dropped span. If pp2 tokens are dropped and these form pp3 contiguous spans, the new sequence length is

pp4

This does not preserve every original token, but it preserves the sequence’s coarse positional progression in a way that standard deletion does not. A supplementary ablation compares sequence-first, sequence-middle, and sequence-end placement and finds sequence-end consistently best, especially at low retention (Ouyang et al., 7 Jan 2026).

4. Training setup, losses, and implementation

FocusUI is implemented on Qwen2.5-VL and Qwen3-VL backbones, with reported variants including FocusUI-3B, FocusUI-7B, and a Qwen3-VL-2B version. The training corpus follows GUI-Actor and includes about 1M screenshots from public GUI datasets: UGround, GUI-Env, GUI-Act, AndroidControl, AMEX, and Wave-UI. The raw set contains 1,012K screenshots and 9.6M elements; after OmniParser-based annotation checking, the filtered set contains 976K screenshots and 7.4M elements. The filtering removes samples whose IoU between ground-truth and OmniParser-detected boxes is below 0.3, eliminating 22.9% of elements (Ouyang et al., 7 Jan 2026).

The overall training loss is

pp5

Here, pp6 is the standard next-token prediction loss, while pp7 aligns grounding attention with target patch overlap. The attention module transforms the action-related hidden state pp8 and selected visual features pp9, computes attention scores

Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.0

and supervises the resulting distribution against normalized overlap labels.

Training is two-stage: saliency scorer pretraining for 1 epoch with learning rate Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.1, followed by full-model fine-tuning for 1 epoch with learning rate Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.2. Other implementation details include DeepSpeed ZeRO-2, FlashAttention-2, bfloat16, and 8Ɨ NVIDIA H200 GPUs. Reported training times are about 12 hours for stage 1 and about 36 hours for stage 2 on the 3B model, or 48 hours for the 7B model. During training, the retention ratio Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.3 is sampled uniformly from Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.4, and the UI-graph threshold is Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.5 (Ouyang et al., 7 Jan 2026).

5. Benchmarks and empirical results

FocusUI is evaluated on four UI grounding benchmarks: ScreenSpot-V2 with 1,272 samples across mobile, desktop, and web; ScreenSpot-Pro with 1,581 samples from 23 professional applications; OS-World-G with 564 samples; and UI-Vision with 5,790 desktop-centric samples from 83 applications. ScreenSpot-Pro is especially important because it combines high-resolution screenshots and complex layouts; its average resolution is 3267Ɨ1727, and its maximum is 6016Ɨ3384 (Ouyang et al., 7 Jan 2026).

The headline result is on ScreenSpot-Pro: FocusUI-7B achieves 48.3 at Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.6, compared with 44.6 for GUI-Actor-7B, a gain of 3.7 points. At 30% visual token retention, FocusUI-7B reaches 45.1, which is only a 3.2-point drop from the full-token setting. The efficiency table reports that on Qwen2.5-VL this corresponds to 1.44Ɨ faster inference and 17% lower peak GPU memory.

The broader pattern is similar across benchmarks. For FocusUI-7B:

  • ScreenSpot-V2: 93.1 at Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.7, 91.8 at Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.8
  • ScreenSpot-Pro: 48.3 at Gh=⌊HpāŒ‹,Gw=⌊WpāŒ‹.G_h = \left\lfloor \frac{H}{p} \right\rfloor,\qquad G_w = \left\lfloor \frac{W}{p} \right\rfloor.9, 45.1 at p=14p=140
  • OS-World-G: 54.4 at p=14p=141, 53.9 at p=14p=142
  • UI-Vision: 24.9 at p=14p=143, 23.8 at p=14p=144

The paper’s most pointed comparison is against generic pruning methods. On ScreenSpot-Pro at 30% retention, a dense FocusUI baseline scores 43.8, while FocusUI with saliency scorer + PosPad scores 40.6. By contrast:

  • Fast-V: 4.8
  • HiPrune: 18.0
  • Vision-Zip: 18.9

This is the paper’s primary evidence that generic visual pruning fails on UI grounding because of broken positional information. Ablations reinforce the same conclusion. On ScreenSpot-Pro with 50% retention:

  • CLIP + direct drop: 28.5
  • CLIP + full padding: 38.7
  • CLIP + PosPad: 38.2
  • Ins2Patch + direct drop: 29.2
  • Ins2Patch + full padding: 42.1
  • Ins2Patch + PosPad: 42.3

The saliency fusion itself is also validated: at 50% reduction on ScreenSpot-Pro, UI-graph labeling only gives 41.1, bbox-based labeling only gives 39.8, and fused supervision gives 42.3 (Ouyang et al., 7 Jan 2026).

6. Position within UI grounding research

FocusUI belongs to a broader line of work that replaces generic multimodal assumptions with UI-specific inductive biases, but its particular contribution is efficiency rather than interface summarization, action-conditioned representation learning, or explicit UI-element reasoning. This distinguishes it from several adjacent methods.

Relative to Focus, which decomposes grounding into interface summarization, focused visual analysis, and precise coordinate prediction, FocusUI does not add a slow reasoning path; it instead reduces token cost while preserving localization quality. Relative to UILoop, which argues for a Screen–UI Elements–Action paradigm and explicitly supervises localization, semantic function, and practical usage of key UI elements, FocusUI remains a grounding framework rather than a full element-centric reasoning system. A plausible implication is that these approaches are complementary: FocusUI addresses the efficiency bottleneck of high-resolution screenshots, while element-centric methods address interpretability and semantic decomposition (Li et al., 8 Apr 2026).

The paper states one main limitation explicitly: current efficiency gains come primarily from spatial token reduction, whereas GUI interaction often unfolds over temporal or multi-round sequences. Future work is therefore suggested in the temporal dimension. More broadly, the method’s analysis indicates that precision-sensitive GUI tasks cannot be treated as ordinary visual token pruning problems. Its durable contribution is the argument that in UI grounding, position-aware token reduction is not an optimization detail but a task requirement (Ouyang et al., 7 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FocusUI.