SeeClick Paradigm: Interactive Click-Driven Vision

Updated 5 March 2026

SeeClick Paradigm is a vision-centric, click-driven approach that iteratively refines predictions via user spatial input.
It decouples spatial reasoning from high-level decision making, enabling scalable, explainable, and cross-domain automation.
Applications span interactive segmentation, GUI automation, tracking, and click modeling, significantly reducing user input requirements.

The SeeClick Paradigm encompasses a family of vision-centric, click-driven methodologies for modeling, predicting, and utilizing discrete user attention and interaction in computational systems. Originating in interactive segmentation and visual attention studies, SeeClick has been generalized to GUI agents, tracking, scene graph generation, and click modeling in online search. It is unified by the loop in which a user (or model) first “sees” an image or result, then “clicks” to issue a prompt or signal, with the system updating its prediction or action accordingly. Recent work further abstracts SeeClick as a modular interface between perception and action, decoupling the spatial reasoning (“where to click”) from high-level policy or intent (“what to click”), thereby enabling scalable, explainable, and cross-domain automation.

1. Foundations and Principles

SeeClick can be formalized as an iterative, perception-driven human-in-the-loop process. The prototypical SeeClick loop incorporates:

Perceive: The user inspects the current output (e.g., segmentation mask, search results, GUI screen, video frame).
Click: The user provides a spatial cue (typically a point, but sometimes a region or bounding box), signaling intent or correction.
Model Update: The system propagates this input via appropriately designed architecture (e.g., with fused embeddings or mask update).
Repeat: The process continues until a target metric (e.g., segmentation IoU, correct GUI action, adequate search result) is reached or user interaction ends.

This paradigm reframes interactive vision tasks from one-shot mapping to an incremental, data-efficient, and explainable interface, emphasizing the role of discrete spatial feedback over continuous input streams.

Beyond segmentation, SeeClick arises wherever click signals modulate automated interpretation or response: simulating gaze via click-importance maps in BubbleView (Kim et al., 2017), refining object tracks (Wang et al., 2024), augmenting search engine click models with visual bias (Xu et al., 2021), grounding natural language in GUIs (Cheng et al., 2024), or generating video scene graphs with point prompts (Ruschel et al., 20 Nov 2025).

2. Instantiations Across Domains

Interactive Image Segmentation

In state-of-the-art segmentation pipelines (e.g., SimpleClick (Liu et al., 2022)), each user click is encoded as spatial disk maps added to the input, and the system updates the mask estimate via a plain Vision Transformer (ViT) backbone. Masked pretraining and symmetric patch embedding layers enable efficient, scalable propagation of click information while preserving feature alignment. Iterative losses (Normalized Focal Loss, Dice) drive the model to predictive efficiency under minimal clicks.

PseudoClick (Liu et al., 2022) extends this by having the network automatically propose "pseudo-clicks"—predicted points of highest error—to accelerate convergence, blending human and model-driven interaction in the See→Click→Predict→PseudoClick→Predict sequence.

Visual Attention and Eye-Tracking Approximation

BubbleView (Kim et al., 2017) utilizes SeeClick as a proxy for gaze, where each click reveals a circular region in a blurred image, revealing which regions participants deem important. Empirical studies show BubbleView clicks correlate highly with true eye fixations and can be systematically aggregated into importance/saliency maps, supporting model training, perceptual experiments, and UI/graphic evaluation.

GUI Automation and Grounding

SeeClick has been adapted as a vision–language interface for accurate GUI element localization in the absence of structured metadata or accessibility tags (Cheng et al., 2024). Here, SeeClick formalizes GUI grounding as estimating the conditional distribution $p(y \mid s, x)$ , where $s$ is the screenshot and $x$ the NL instruction, outputting coordinates $y$ . Grounding is pre-trained over million-scale datasets with mixed tasks (text-to-point, text-to-box, box-to-text) and becomes the spatial module in a decoupled agent framework (Hoscilowicz et al., 2024), allowing specialized LLMs to plan high-level actions while SeeClick—or lighter models like TinyClick—localize the relevant UI elements.

Video Object Segmentation and Tracking

Click-based interactive video segmentation frameworks (CiVOS (Vujasinovic et al., 2022)) adopt the SeeClick paradigm by mapping user clicks to spatial heatmaps, updating segmentation masks in a modular interaction–propagation pipeline via memory networks and difference-aware fusion. For tracking (Wang et al., 2024), the Guided Click Refiner (GCR) transforms single-point clicks (optionally with text) into bounding boxes, initializing downstream trackers or segmenters with minimal latency and competitive accuracy.

Click Modeling in Search Engines

The SeeClick paradigm underpins vision-biased click models (Xu et al., 2021), where the probability of a user "seeing" an image search result—modeled via vision features—is combined with positional bias to better fit click logs. Regression-EM learns the mapping from image features to the vision bias term, significantly improving prediction accuracy and data efficiency under sparse observation.

3. Algorithmic and Architectural Components

SeeClick implementations typically employ the following architectural motifs:

Click Encoding: User prompts are represented as binary or soft disk maps (positive/negative), heatmaps, or coordinate tokens, fused into the visual input stream or as separate channels.
Patch-based Fusion: In segmentation and grounding models (e.g., SimpleClick (Liu et al., 2022)), the image and user input maps are patchified and linearly embedded before element-wise fusion; this is crucial for alignment in transformer backbones and for enabling pre-trained weights transfer.
Modular Decoupling: Especially in GUI agents (Cheng et al., 2024, Hoscilowicz et al., 2024), planning (via LLMs) and perception (via SeeClick-style vision-LLMs) are modularly separated, allowing each to be improved independently.
Prompt Propagation and Mask Refinement: In interactive video and scene graph tasks (Vujasinovic et al., 2022, Ruschel et al., 20 Nov 2025), interaction modules are paired with propagation or discovery modules that extend the impact of each click to temporally or semantically related regions.
Losses and Training: Multi-task objectives, often including normalized focal loss, cross-entropy, regression losses, and set-based matching, enforce spatial grounding, segmentation, and semantic alignment simultaneously.

4. Evaluation Metrics and Benchmarks

Domain-specific metrics quantify SeeClick performance:

Interactive Segmentation: Number of Clicks (NoC@τ) to reach a target intersection-over-union (IoU), mean IoU after $k$ clicks, or annotation time reduction. State-of-the-art examples: SimpleClick achieves NoC@90 = 4.15, a 21.8% reduction over prior methods (Liu et al., 2022); PseudoClick further reduces user effort by automatically proposing clicks (Liu et al., 2022).
Attention Approximation: Correlation coefficients (CC), normalized scanpath saliency (NSS), KL divergence, and agreement with eye-tracked data (BubbleView, (Kim et al., 2017)).
GUI Grounding: Click accuracy (fraction of predictions within the correct bounding box), separately for text versus icon elements and cross-platform (ScreenSpot benchmark (Cheng et al., 2024)).
Video Tracking and Scene Graphs: Prompt localization recall, spatial interaction recall, and end-to-end triplet recall at rank $K$ (Ruschel et al., 20 Nov 2025), as well as real-time tracking speed (Wang et al., 2024).
Click Modeling: Log-likelihood, perplexity, and MRR on search logs, with careful breakdowns by query frequency and position (Xu et al., 2021).

Recent RClicks benchmarks (Antonov et al., 2024) highlight the need for realistic, distribution-aware click simulation for robust segmentation model evaluation, showing that common baseline strategies systematically underestimate annotation costs by 5–30% and fail to capture user click variability.

5. Impact, Generalization, and Limitations

The SeeClick paradigm has led to substantial gains across several domains:

Annotation Efficiency: Dramatic reductions in user input demand for segmentation and tracking, with click-based pipelines matching or surpassing classic scribble- or box-based workflows (Liu et al., 2022, Vujasinovic et al., 2022, Wang et al., 2024).
Scalable Attention Modeling: BubbleView’s click-based data is now a standard for saliency model training, enabling large-scale, cost-effective, and privacy-preserving perceptual studies (Kim et al., 2017).
Cross-Domain Generalization: Pre-trained SeeClick models generalize well to medical imaging and out-of-domain GUI platforms without retraining (Liu et al., 2022, Cheng et al., 2024), suggesting underlying robustness of the click–response mapping.
Decoupled Modularity: Decoupling perception from decision surfaces in GUI agents provides both robustness to hallucination and rapid upgradability, as specialized UI localizers can replace or augment vision-LLMs built into LLMs (Hoscilowicz et al., 2024).

However, limitations persist:

Click Simulation Realism: Standard center-of-error click simulators fail to match real human behavior, resulting in inflated benchmark results and hidden robustness flaws (Antonov et al., 2024). Richer, distribution-aware metrics and training regimes are now promoted.
Ambiguity and Hierarchy: Single clicks may be ambiguous (small objects, overlapping regions, hierarchical containment) (Wang et al., 2024). Approaches using auxiliary text or guided click heads help mitigate, but challenges remain.
Speed and Feedback: While click input is faster than drawing, rapid, dense tasks (e.g., video) may still overload users relative to fully automated methods. Real-time constraints necessitate extremely lightweight fusion and inference designs.

6. Representative Table: SeeClick Paradigm Instantiations

Domain	Model/Framework	Key Metric(s)	Notable Result(s)
Img Segmentation	SimpleClick (Liu et al., 2022)	NoC@90	4.15 on SBD (↓21.8%); strong medical generalization
Visual Saliency	BubbleView (Kim et al., 2017)	CC, NSS, SIM, KL	CC≈0.84 on visualizations; 10–15 participants suffice
GUI Grounding	SeeClick (Cheng et al., 2024)	Click accuracy, task SR	53.4% click acc overall; clear gains on AITW, Mind2Web
Video Object Segm.	CiVOS (Vujasinovic et al., 2022)	R-metric, AUC_JF	R=0.76 with clicks; approach scribble-based methods
Tracking	ClickTrack (Wang et al., 2024)	Succ. rate, Precision, mAO	62.4% AUC w/ point; 65.0% w/ point+text; 31 FPS
Click Modeling	vUBM (Xu et al., 2021)	Log-likelihood, MRR	+5.5% LL; +3.4% MRR (low-freq q’s) over classical PBM

This table aggregates salient instantiations, underscoring the cross-domain applicability of SeeClick principles.

7. Future Directions and Methodological Implications

Emerging areas of SeeClick research target:

Dense, Multi-modal Prompt Fusion: Integrating click maps with language, audio, or continuous feedback for more expressive, context-aware response modeling.
Human-in-the-Loop Robustness: Explicit trends toward training with realistic, user-sourced click distributions (e.g., RClicks) to harden models against deployment shift.
Rich UI Reasoning: Fusing SeeClick-style spatial grounding with retrieval-augmented and chain-of-thought LLMs (Hoscilowicz et al., 2024), enhancing both precision and explainability in GUI agents.
Continuous Distillation: Automated curation and continual learning on evolving GUI or visual domains sustaining SeeClick model relevance without manual annotation (Cheng et al., 2024).
Unified Cross-task Evaluation: Benchmarks harmonizing SeeClick across attention, segmentation, grounding, and search to yield comprehensive generalization diagnostics.

SeeClick, in various implementations, has established itself as a foundational interactive paradigm in modern computer vision, interactive machine learning, and multimodal agent research, with broad applicability and ongoing methodological innovation.