Scanpath Selection Policy Overview

Updated 4 June 2026

Scanpath Selection Policy is a framework that generates gaze fixations as a sequential, conditional probability process based on visual content and prior fixations.
It encompasses both deterministic and stochastic models, from static saliency maps to advanced deep neural architectures, to capture human-like variability in gaze behavior.
Key applications include simulating human visual attention, optimizing sensor bandwidth, and enabling task-driven perception in both 2D imagery and 3D volumetric data.

A scanpath selection policy defines the algorithmic or probabilistic mechanism by which a sequence of gaze fixations—spatial (and sometimes temporal) locations in an image, video, or volumetric data—is generated, typically to model or predict human visual attention. Modern scanpath policies are cast as conditional distributions or deterministic mappings that, at each step, use both the stimulus context (image/video/task) and past fixation history to select the next fixation point. This mechanism is central to computational models of visual attention, enabling not only prediction and simulation of human gaze, but also active perception, bandwidth-efficient sensing, and task-driven vision.

1. Mathematical Characterization of Scanpath Policies

The foundational perspective is to treat scanpath selection as a sequential, conditional probability process. For an image $I$ and a scanpath $S = (f_0, f_1, ..., f_N)$ , most models factor the joint probability as:

$p(f_0, ..., f_N \mid I) = \prod_{t=0}^{N} p(f_t \mid f_0, ..., f_{t-1}, I)$

At each step, the policy computes $P(f_t \mid \text{history}, I)$ , commonly in the form of a priority/saliency map over pixel locations, from which the next fixation can be sampled or chosen as a maximum. Extensions incorporate fixation duration, object-based representations, and 3D volumetric coordinates in task-adaptive settings (Kümmerer et al., 2021, Pham et al., 16 Jul 2025).

Table: Canonical Policy Formulations

Model Family	Conditional Policy	History Usage
Static saliency	$P(f_t \mid I) \propto S(x, y \mid I)$	None
Markov saliency	$P(f_t \mid f_{t-1}, I) \propto S(x, y) \cdot \text{IoR}(x, y; f_{t-1})$	1-step, with IoR
RNN/Transformer	$P(f_t \mid f_0, ..., f_{t-1}, I)$ via deep sequential module	Full history
Probabilistic, sampled	$P(f_t \mid \cdots)$ as categorical distribution, sampled at inference	Stochastic sampling

2. Probabilistic and Deep Policy Architectures

Diverse strategies instantiate scanpath policies, ranging from analytical saliency + bias models to deep neural architectures. Notable representatives include:

GAN-based scanpath generators: PathGAN employs a generator network that, for given image and noise input, autoregressively outputs fixation coordinates and dwell times, trained with a mixture of adversarial (cGAN) and L2 content losses for realism and diversity (Assens et al., 2018).
Bayesian recurrent models: Probabilistic time-evolving frameworks utilize ConvLSTM architectures with variational weights, outputting per-step spatial distributions, trained under spatio-temporal losses combining KL divergence, dynamic time warping, and center-bias regularization (Martin et al., 2022).
Multi-modal or individualized modules: Recent advances integrate observer encoding, user/task guidance, semantic segmentation, and fusion with language or 3D scene context. The fixation map at each step becomes observer-adaptive, task- or history-aware, and potentially object- or language-conditioned (Chen et al., 2024, Quan et al., 22 Apr 2026, Pham et al., 16 Jul 2025).

The core mechanism is always sequential, with the next fixation determined by the conditional output of a spatial probability or activation map, typically modulated by saliency, oculomotor bias, and the model's memory.

3. Deterministic vs. Stochastic Policies

Scanpath policies can be deterministic, sampling-free, or probabilistic:

Deterministic policies: Some policies compute the next fixation as a (soft-)argmax of a map. For example, SalyPath360 obtains each fixation by applying spatial soft-argmax over a dedicated attention heatmap channel, yielding a fixed-length sequence devoid of recurrent dependencies (Kerkouri et al., 2022).
Stochastic (sampling) policies: The majority, especially those mimicking human variability, explicitly sample fixations from learned spatial probability maps. UMSS samples according to saliency modulated by foveal continuity and duration allocation (Wang et al., 2021), while PathGAN and probabilistic deep policies also inject random noise or sample via categorical distributions over the priority maps (Assens et al., 2018, Martin et al., 2022).

Stochasticity is often critical for matching the inter- and intra-observer variability observed in human scanpaths and is evaluated using distributional or sequence-matching metrics.

4. Object-Based, Uncertainty-Guided, and Task-Driven Policies

Advanced approaches leverage task signals, object semantics, or scene uncertainty to drive scanpath selection:

Object-based and uncertainty-guided: Robotics-inspired policies use Bayesian filtering for scene segmentation and uncertainty estimation. Fixation selection follows a drift-diffusion process, accumulating saliency- and uncertainty-weighted evidence per object; saccades are triggered when the evidence for an object crosses a threshold, promoting balanced object inspection and temporal inhibition-of-return effects (Mengers et al., 2024).
Task-conditioned sensor attention: In foveated imaging systems, the scanpath policy is formulated as a sequential decision process over sensor viewport actions, optimizing downstream task reward (e.g., classification, tracking) under hard pixel budgets. The policy is trained, often by supervised imitation of an oracle, to allocate sensor bandwidth to the most informative regions (Xiao et al., 1 Jun 2026).
Object referring and language-conditioned: The ScanVLA architecture fuses vision and text via a large multimodal model and segmentation modules, utilizing a history-encoded decoder to select valid fixations aligned with the referent object, conditioned on the linguistic query (Quan et al., 22 Apr 2026).

This family of policies illustrates a trend toward more grounded, semantically precise, and utility-driven gaze modeling beyond general free-viewing tasks.

5. Evaluation, Metrics, and Behavioral Alignment

Standard and enriched evaluation protocols are crucial for scanpath selection policies, as metric selection directly impacts conclusions regarding human-likeness and task alignment:

Per-fixation metrics: Area under curve (AUC), normalized scanpath saliency (NSS), and log-likelihood per fixation are widely used (Kümmerer et al., 2021).
Sequence metrics: Sequence Score, Dynamic Time Warping (DTW), MultiMatch, and Jarodzka vector distance capture spatio-temporal and alignment constraints; their sensitivity to dataset bias is actively debated (Pan et al., 16 Feb 2026, Martin et al., 2022).
Debiased composite metrics: The Gaze Consistency Score (GCS), incorporating baseline debiasing and movement similarity, exposes regimes where models achieve genuine behavioral alignment rather than exploiting trivial spatial biases (Pan et al., 16 Feb 2026).
Qualitative indicators: Human expert judgment, success at reproducing specific human-like tendencies (e.g., object inspection, temporal inhibition-of-return), and ablation-driven diagnostics refine the alignment assessment (Mengers et al., 2024, Pham et al., 16 Jul 2025).

Rigorous benchmarking typically involves both comparison to human inter-observer variability and ablation against center or oculomotor baselines.

6. Policy Adaptation: Individualization, 3D, and Application-Specific Regimes

Modern scanpath policies are increasingly individualized, volumetric, and adaptive to task/statistics:

Individualized prediction: Observer encoders parameterize the policy by embedding user traits, which modulate both spatial guidance and dynamic prioritization, resulting in scanpath distributions specific to each observer (Chen et al., 2024).
3D scanpath selection: For volumetric data, the policy acts on 3D tokens with positional encoding, optionally pre-trained via synthetic 2D-to-3D lifted datasets, yielding scanpaths with both spatial and depth coordinates tailored to expert medical reading (Pham et al., 16 Jul 2025).
Constrained and task-driven regimes: Hard-attention models on object-centric datasets reveal conditions (peripheral “sweet spots”) where sensorimotor constraints induce optimal alignment to human gaze, confirmed by debiased composite metrics rather than raw task accuracy (Pan et al., 16 Feb 2026).

Table: Modern Regimes and Policy Components

Mode	State Representation	Policy Conditioning	Notable Example
General free-viewing	Image, past fixations	None or saliency	DeepGaze III (Kümmerer et al., 2021)
Individualized	Image, past fixations, user	Observer traits	ISP (Chen et al., 2024)
3D volumetric	Volume, past fixations	Segmentation, stop token	CT-Searcher (Pham et al., 16 Jul 2025)
Task-driven/foveated	Sensor history, tasks	Downstream reward, budget	Policy-based Foveation (Xiao et al., 1 Jun 2026)

7. Open Challenges and Future Directions

Outstanding issues in scanpath selection policy research include:

Multimodality and overfitting: How to learn and evaluate policies on datasets with ambiguous human gaze or multiple plausible scanpaths without mode collapse.
Integration of semantics and context: Efficiently leveraging high-level scene/object knowledge, multi-modal cues, and top-down goals for robust scanpath generation.
Interpretability and control: Balancing end-to-end data-driven policies with mechanistically interpretable control (e.g., explicit IoR, semantically aware guidance).
Evaluation under real-world constraints: Ensuring policy utility in real-time, low-resource, or bandwidth-limited applications, where active selection truly matters (Xiao et al., 1 Jun 2026).
Benchmark clarity: Broad adoption of debiased, context-sensitive metrics, and reporting of trivial/oculomotor baselines to reveal substantive progress (Pan et al., 16 Feb 2026).

These directions point toward policies that are not only statistically human-like, but also contextually adaptable, individualized, and ecologically valid for both modeling and applied machine vision.