Multi-modal user interface control detection using cross-attention

Published 8 Apr 2026 in cs.CV and cs.AI | (2604.06934v1)

Abstract: Detecting user interface (UI) controls from software screenshots is a critical task for automated testing, accessibility, and software analytics, yet it remains challenging due to visual ambiguities, design variability, and the lack of contextual cues in pixel-only approaches. In this paper, we introduce a novel multi-modal extension of YOLOv5 that integrates GPT-generated textual descriptions of UI images into the detection pipeline through cross-attention modules. By aligning visual features with semantic information derived from text embeddings, our model enables more robust and context-aware UI control detection. We evaluate the proposed framework on a large dataset of over 16,000 annotated UI screenshots spanning 23 control classes. Extensive experiments compare three fusion strategies, i.e. element-wise addition, weighted sum, and convolutional fusion, demonstrating consistent improvements over the baseline YOLOv5 model. Among these, convolutional fusion achieved the strongest performance, with significant gains in detecting semantically complex or visually ambiguous classes. These results establish that combining visual and textual modalities can substantially enhance UI element detection, particularly in edge cases where visual information alone is insufficient. Our findings open promising opportunities for more reliable and intelligent tools in software testing, accessibility support, and UI analytics, setting the stage for future research on efficient, robust, and generalizable multi-modal detection systems.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper proposes a multi-modal YOLOv5 extension that integrates GPT-generated text with visual embeddings via cross-attention to tackle UI control detection challenges.
It systematically compares three fusion methods—element-wise addition, weighted sum, and convolutional fusion—with convolutional fusion yielding the highest gains on challenging UI classes.
Empirical results on 16,155 screenshots demonstrate improved F1-scores and mAP, validating multi-modal fusion's effectiveness over baseline vision-only approaches.

Problem Formulation and Motivation

The paper addresses the task of user interface (UI) control detection from software screenshots, which is pivotal for automation in testing, accessibility, and UI analytics. Conventional detection methods, primarily vision-based, suffer from weaknesses such as susceptibility to visual ambiguities, inter-element similarity, and lack of semantic context—particularly in cluttered or visually homogeneous UI designs. These limitations motivate the augmentation of object detection models with complementary modalities, notably textual descriptions, to enrich context and aid disambiguation. The authors focus on leveraging text generated via LLMs for multi-modal feature fusion, aiming to mitigate the challenges inherent in pixel-only approaches.

Methodological Innovations

The core contribution is a multi-modal extension of YOLOv5, augmented by cross-attention modules that integrate GPT-generated textual descriptions with visual embeddings. The architecture retains the standard backbone, neck, and head structure of YOLOv5, with cross-attention introduced after C3 blocks in the neck to align image features with semantic vectors derived from textual input. Three fusion mechanisms are systematically analyzed: element-wise addition, weighted sum, and convolutional fusion.

Element-wise addition: Minimal complexity; text and image attended features are added directly.
Weighted sum: Introduces trainable weights to modulate contributions from each modality.
Convolutional fusion: Concatenates feature maps from both modalities and applies convolutions to capture spatially adaptive, non-linear feature interactions.

Text descriptions are generated by GPT-4o, and converted to embeddings via OpenAI's text-embedding-3-large model. Fine-tuning occurs on paired screenshot/text data.

Empirical Evaluation

Experiments are conducted on a custom dataset with 16,155 annotated UI screenshots spanning 23 control classes. The models are benchmarked on precision, recall, F1-score, and [email protected]. The baseline YOLOv5 model demonstrates strong performance on visually distinctive classes but fails for semantically complex or spatially ambiguous classes, such as Horizontal_Axis, Vertical_Axis, and Label_of_the_Textarea.

The multi-modal variants achieve the following strong numerical results:

Element-wise addition (5 attention blocks): F1-score 0.719, [email protected] 0.681.
Weighted sum (5 attention blocks): F1-score 0.738, [email protected] 0.703.
Convolutional fusion (5 attention blocks): F1-score 0.761, [email protected] 0.732.

Convolutional fusion demonstrates substantial gains in difficult classes: Horizontal_Axis achieves F1-score 0.468 and [email protected] 0.509, markedly surpassing baseline performance and outperforming previous UI detection models. All fusion strategies outperform YOLOv5 baseline in recall and F1, with convolutional fusion exhibiting the best trade-off between detection accuracy and semantic robustness.

Ablation Analysis and Robustness

Ablation studies reveal the sensitivity of the multi-modal model to textual signal integrity:

Mismatched text ablation induces up to 14.5% reduction in [email protected], indicating strong dependency on accurate context.
Partial text ablation (missing references for a class) results in substantial per-class drops in recall and mAP, further underscoring information dependence.

Despite these vulnerabilities, the model retains improved performance over baseline for certain rare classes, suggesting that multi-modal fine-tuning imparts lasting benefit even when context is partially corrupted.

Computational analysis shows convolutional fusion incurs a ~13% increase in parameter count and up to 26% increase in inference time compared to baseline, justifying its use in scenarios prioritizing detection accuracy over deployment efficiency.

Implications and Future Directions

Practical Impact

The proposed model offers enhanced robustness for UI control detection pipelines in automated testing and accessibility—directly addressing critical failure points of vision-only architectures. Improved detection of ambiguous classes benefits screen reader technologies and analytics platforms that rely on accurate UI element tabulation. The framework enables integration with LLMs for dynamic context generation, rendering it highly adaptable to evolving UI ecosystems.

Theoretical Consequences

On the theoretical front, the results empirically validate the efficacy of cross-modal feature fusion via cross-attention for structured image analysis. The systematic evaluation of fusion methods contributes to understanding the modality interaction landscape, with convolutional fusion emerging as the most expressive mechanism.

Limitations and Speculative Advancements

The sensitivity to language input fidelity points toward the need for robustification via noise-aware or confidence-weighted fusion, potentially using uncertainty modeling or adversarial data augmentation. Computational overhead constrains deployment in latency-sensitive or resource-bounded applications, prompting exploration of efficient attention variants or distillation-based approaches.

A promising avenue is extension to interactive and dynamic UIs, integration with real-time captioning, and leveraging larger vision-language datasets. Self-supervised approaches for textual descriptor generation and alignment may further generalize the framework.

Conclusion

This paper establishes a new benchmark in UI control detection by demonstrating substantial performance gains through multi-modal fusion of visual and language modalities in YOLOv5. The integration of GPT-generated descriptions via cross-attention enables superior detection of semantically ambiguous or visually subtle controls. Although computational trade-offs and robustness to noisy input remain areas for enhancement, the framework's applicability spans automated testing, accessibility, and UI analytics. The findings lay a foundation for future research in efficient, generalizable, and robust multi-modal detection systems across software platforms.

Markdown Report Issue