Temporal Visual Screening Overview

Updated 4 July 2026

Temporal Visual Screening is a family of techniques that treats visual signals as dynamic processes rather than static images.
It employs specialized temporal models such as 3D CNNs, GRUs, and DTW to preserve sequence information from event-based sensing and video trajectories.
Empirical results show that incorporating temporal dynamics substantially improves accuracy over traditional frame-by-frame analysis.

Temporal Visual Screening, as an umbrella formulation suggested by recent work in vision, sensing, and computational screening, denotes screening or assessment in which the target signal is a temporally evolving visual process rather than a static image. In this sense, the monitored phenomenon may be an event stream generated by the eye region, rhythmic retinal venous pulsation, perceptual trajectory degradation in a video, 3D head-pose and facial dynamics during attention tasks, gaze-exploration behavior, or EEG-aligned ocular timing dynamics (Ren et al., 31 Mar 2025, Sheng et al., 2023, Liao et al., 2022, Qadir et al., 3 Jun 2026, Alcala-Durand et al., 1 Sep 2025, Sarkar et al., 14 May 2026). A unifying implication of these works is that screening depends on temporal ordering, duration, continuity, recurrence, or long-range dependence, so framewise inspection alone is usually insufficient.

1. Conceptual basis

Temporal Visual Screening differs from static visual screening by treating the visual target as a process. In event-based eye tracking, the relevant object is a dynamic ocular process containing gaze shifts, saccades, blinks, and resets, and the core claim is that accurate pupil localization depends on modeling temporal dynamics instead of treating observations independently (Ren et al., 31 Mar 2025). In retinal spontaneous venous pulsation assessment, the target is a weak rhythmic caliber change in the optic disc region, so the diagnostically relevant signal is a subtle cyclic deformation over time rather than a still-image feature (Sheng et al., 2023). In blind video quality assessment, temporal quality is represented by the geometry of frame trajectories in perceptual spaces associated with LGN and V1, and distortion is interpreted as curvature and fragmentation of those trajectories (Liao et al., 2022).

The same temporal logic appears in behavioral and neurophysiological screening. In school-age autism spectrum disorder screening, the signal of interest is not a static face image but evolving 3D head displacement and pose-invariant facial dynamics during VR-CPT attention tasks (Qadir et al., 3 Jun 2026). In Parkinson’s disease screening from visual exploration, the useful signal is the time-ordered pattern of fixations, saccades, dwell persistence, and returns to gaze clusters across six 15-second exploration tasks (Alcala-Durand et al., 1 Sep 2025). In EEG-based assessment of ocular response timing for mTBI-oriented evaluation, the problem is posed as temporal alignment between AR stimulus trajectories and EEG-derived features, with Dynamic Time Warping used after neural decoding (Sarkar et al., 14 May 2026).

This body of work suggests that Temporal Visual Screening is best understood as a family of screening problems in which the decisive evidence is temporal organization embedded in visual, oculomotor, or neuro-visual signals. The common goal is not merely to detect what is present, but to determine how a visual phenomenon unfolds.

2. Signal models and temporal representations

Representative TVS systems differ widely in modality, but they share explicit temporal signal parameterizations. Event-based eye tracking starts from an asynchronous event set

$\mathcal{E}=\left\{e_i=(x_i,y_i,t_i,p_i)\right\},$

where spatial location, timestamp, and polarity are preserved at event level. TDTracker ultimately uses a frame-based tensor

$\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$

with Binary Map representation chosen to preserve temporal ordering explicitly along $T$ (Ren et al., 31 Mar 2025). This is a canonical TVS representation: the signal is sparse, irregular, and time-indexed before inference.

In perceptual video screening, TPQI represents each frame by a reduced feature $\mathbf{x}_i\in\mathbb{R}^d$ in LGN- or V1-derived space, then evaluates local temporal trajectory units over three consecutive frames. The curvature term

$\theta_i$

measures local trajectory bending, the compactness term

$S_i=\|\mathbf{x}_{i+1}-\mathbf{x}_{i-1}\|,$

and the selected local distortion descriptor is

$Q_i=\theta_i\times\sqrt{S_i}.$

Temporal pooling across a video yields

$Q_{TPQI}= \frac{ \log\left(\frac{1}{N-2}\sum_{i=2}^{N-1} Q_i^{LGN}\right) + \log\left(\frac{1}{N-2}\sum_{i=2}^{N-1} Q_i^{V1}\right) }{2},$

so the screening variable is not a frame score but a global temporal perceptual integrity index (Coclite et al., 2022).

In 3D temporal ASD screening, DECA yields pose-invariant expression parameters

$\boldsymbol{\psi}\in\mathbb{R}^{50}$

and 3D head pose

$\boldsymbol{\theta}\in\mathbb{R}^{6}=[\theta_{\text{yaw}},\theta_{\text{pitch}},\theta_{\text{roll}},T_x,T_y,T_z],$

with frame-level fusion

$\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 0

A participant video is then represented as a sequence

$\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 1

which is screened at sequence level rather than frame level (Qadir et al., 3 Jun 2026).

Gaze-based Parkinson’s screening uses normalized gaze coordinates

$\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 2

and converts them into both classic event summaries and state-like descriptors derived from Gaussian-mixture-defined High-Density Areas. The HDA features include fractional occupancy

$\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 3

mean lifetime

$\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 4

mean interval length

$\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 5

and entropy of states

$\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 6

These quantities summarize how long gaze stays in a region, how often it returns, and how diversely it explores the stimulus (Alcala-Durand et al., 1 Sep 2025).

In EEG-guided ocular timing assessment, the target trajectory is reduced to the L2 distance between the AR patch center and the AR window center, resampled at 100 Hz, while EEG is windowed into 2.56 s segments with 200 ms stride and transformed by 4-level RDWT before temporal decoding (Sarkar et al., 14 May 2026). This broad range of formulations suggests that TVS is not tied to one sensing technology; it is defined by the temporal structure of the evidence and the need to preserve or decode it.

3. Screening architectures and operational pipelines

Several recurrent architectural patterns appear across TVS-oriented systems.

System	Temporal mechanism	Screening output
TDTracker (Ren et al., 31 Mar 2025)	3D CNN + Frequency-aware Module + GRU + Mamba	Pupil coordinates via 1D heat vectors
ODR-STL → NATM (Sheng et al., 2023)	Temporal clip selection + template-based stabilization	Stabilized ODR clips for SVP observation
TPQI (Coclite et al., 2022)	HVS trajectory geometry over 3-frame units	Global temporal quality score
DECA + RNN (Qadir et al., 3 Jun 2026)	Full-sequence GRU/LSTM over 3D facial/head features	ASD vs TD classification
HDA + MoE (Alcala-Durand et al., 1 Sep 2025)	GMM-derived gaze-state summaries + expert fusion	PD screening score
RDWT + DTW (Sarkar et al., 14 May 2026)	Wavelet-domain EEG decoding + temporal alignment	Subject-specific ocular timing metrics

TDTracker is organized around a short-range/long-range decomposition. Its Implicit Temporal Dynamic branch uses 3D convolutions to extract local spatiotemporal structure, while its Explicit Temporal Dynamic cascade applies a Frequency-aware Module, GRU, and Mamba to model longer dependencies. Prediction is not direct coordinate regression; instead, the system outputs 1D probability vectors for $\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 7 and $\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 8, trains them with KL divergence against Gaussian-smoothed targets, and decodes coordinates by argmax (Ren et al., 31 Mar 2025). This is a distinctly temporal screening design because local transients and long-range ocular trajectories are separated rather than collapsed into one encoder.

The retinal-video stabilization pipeline is structurally different but temporally analogous. ODR-STL first localizes the optic disc region frame by frame using Faster R-CNN, removes ODR-invisible and heavily jittered frames, and splits video into continuous usable clips. NATM then selects a smooth and sharp template period and performs noise-aware template matching, including specular suppression in blue and green channels, to align the ODR to a fixed field-of-view position (Sheng et al., 2023). The resulting output is not a diagnostic label; it is a temporally prepared clip in which the rhythmic venous signal becomes easier to inspect.

TPQI exemplifies a non-learning temporal screening metric. LGN and V1 perceptual features are reduced with PCA to $\mathbf{I}\in\mathbb{R}^{C\times T\times H\times W},$ 9, arranged as a trajectory, and analyzed through local curvature and compactness. Because it is completely blind and requires no reference video, it functions as a screening score for temporal instability, frame-to-frame inconsistency, or motion-related degradation in naturalistic videos (Coclite et al., 2022).

In ASD screening, temporal modeling occurs after 3D feature extraction. DECA-based features are fed to GRU or LSTM models, then aggregated by temporal max-pooling,

$T$ 0

before dropout and softmax classification. The use of max-pooling is explicitly motivated by the need to capture salient brief behavioral moments such as sudden gaze or head shifts, fleeting atypical expressions, and transient repetitive movements (Qadir et al., 3 Jun 2026). That design places TVS close to event-sensitive sequence classification.

Gaze-based PD screening does not use end-to-end sequence models, but it remains temporally structured. Classic oculomotor features and HDA-derived features are extracted for each exploration and each eye, then combined with SVM-RBF classifiers and a Mixture of Experts ensemble that fuses outputs across tests and eyes (Alcala-Durand et al., 1 Sep 2025). The temporal information is compressed into dwell, return, and scanpath-coverage summaries rather than preserved in raw sequence form.

The EEG/AR ocular timing framework is explicitly alignment-oriented. After RDWT-based denoising, temporal and spatial convolution, and Conv-LSTM decoding, sliding-window predictions are validated by Pearson correlation, and DTW is used with a 50-sample constraint window to quantify timing mismatch between EEG-derived features and target trajectory (Sarkar et al., 14 May 2026). Among the systems considered here, this one is the most direct instance of TVS as temporal alignment rather than classification.

4. Evaluation paradigms and empirical evidence

Empirical results show that temporal modeling is often not merely an architectural preference but the dominant source of performance.

In event-based eye tracking, TDTracker reports state-of-the-art results on SEET. At $T$ 1 resolution with frame-based representation, 3.248M parameters, and 318M FLOPs, it achieves $T$ 2, $T$ 3, $T$ 4, and MSE $T$ 5 px. On real 3ET+ 2025, it reaches $T$ 6, $T$ 7, $T$ 8, and MSE $T$ 9 px, with 1.7923 ms inference time on RTX 4090; the paper also states that TDTracker secured third place in the CVPR 2025 Event-Based Eye Tracking Challenge (Ren et al., 31 Mar 2025). The ablations are especially revealing: removing Implicit-conv degrades performance from $\mathbf{x}_i\in\mathbb{R}^d$ 0, MSE $\mathbf{x}_i\in\mathbb{R}^d$ 1 to $\mathbf{x}_i\in\mathbb{R}^d$ 2, MSE $\mathbf{x}_i\in\mathbb{R}^d$ 3, while removing Mamba yields $\mathbf{x}_i\in\mathbb{R}^d$ 4, MSE $\mathbf{x}_i\in\mathbb{R}^d$ 5. This establishes short-term 3D convolutional temporal modeling as foundational and Mamba as a meaningful long-range gain.

In blind temporal quality assessment, TPQI alone achieves SRCC 0.556 on KoNViD-1k, 0.636 on LIVE-VQC, 0.413 on CVD2014, and 0.111 on YouTube-UGC. When fused with NIQE using product fusion, it reaches SRCC 0.693 on KoNViD-1k, 0.718 on LIVE-VQC, 0.524 on CVD2014, and 0.230 on YouTube-UGC (Coclite et al., 2022). The method is notably stronger on LIVE-VQC, which the paper identifies as containing more temporal variation, and weaker on YouTube-UGC because many videos violate the natural-video trajectory assumption.

In 3D temporal ASD screening, the best unimodal model is 3D head pose with GRU at 83.9% accuracy, while the best overall configuration is 3D fusion with LSTM and PCA at 84.6% accuracy. The paper explicitly states that 3D head pose outperformed 2D head pose by 10.7% accuracy and 3D facial features outperformed 2D facial features by 7.5% (Qadir et al., 3 Jun 2026). These results support the view that translational head displacement and temporally pooled 3D dynamics carry more screening information than static or 2D descriptors.

In gaze-based PD screening, the best single exploration model is Expl. 2 with AUC $\mathbf{x}_i\in\mathbb{R}^d$ 6, while the selected MoE ensemble reaches cross-validated patient-level AUC $\mathbf{x}_i\in\mathbb{R}^d$ 7 (Alcala-Durand et al., 1 Sep 2025). The held-out test performance is internally inconsistent in the paper: the abstract and discussion report AUC 0.95, whereas the validation section reports held-out AUC 0.93 with F1 0.71, sensitivity 0.56, and specificity 1.00. What is stable across sections is the claim that ensemble models outperform single-test models and that visual exploration carries substantial screening signal.

The EEG/AR timing framework does not report disease classification, but it supplies evidence that temporal alignment metrics can distinguish subjects. Sliding-window predictions are retained only when Pearson correlation is at least 0.5; DTW-derived metrics show significant inter-subject differences across all four VOMS tasks by Mann-Whitney U tests at $\mathbf{x}_i\in\mathbb{R}^d$ 8; and pursuit tasks yield post-alignment cross-correlation peaks around 0.6–0.7, with reactive lag behavior, whereas saccades show more anticipatory responses (Sarkar et al., 14 May 2026). This implies that not all temporally structured visual tasks are equally suitable for screening.

Retinal stabilization work provides a different type of evidence. The objective evaluation uses variance of optical flow as a proxy for residual jitter, and the proposed method is reported to achieve the least variance among compared methods. The subjective study recruited 25 subjects with varying expertise levels from four clinics and found that the stabilized videos produced by the method were preferred more often for SVP observation (Sheng et al., 2023). The claim here is not superior diagnosis per se, but improved observability of a temporally subtle diagnostic event.

Temporal Visual Screening overlaps with several adjacent research areas but is not identical to them. Event-based foundation-model work such as TGVFM integrates VFMs with Long-Range Temporal Attention, Dual Spatiotemporal Attention, and Deep Feature Guidance Mechanism, and reports relative improvements of 16% for semantic segmentation, 21% for depth estimation, and 16% for object detection in event-based vision (Xia et al., 9 Nov 2025). This is not a screening system, but it demonstrates a reusable pattern for TVS: pretrained spatial semantics augmented by explicit temporal fusion. Vi-ST similarly combines a frozen DINOv2 prior with causal multiscale temporal modules to align dynamic natural scenes with retinal ganglion cell responses, and shows markedly better cross-video generalization than I3D+MSTCN or DINOv2+MSTCN baselines (Wu et al., 2024). Such systems are best regarded as temporal representation engines that can support downstream screening.

Video synthesis and view synthesis are also adjacent rather than equivalent. Event-driven Video Frame Synthesis reconstructs a latent high-framerate video tensor from low-speed frames and event streams through a differentiable forward model and residual denoising, supporting interpolation, prediction, and motion deblur (Wang et al., 2019). DeCOMPnet for temporal view synthesis of dynamic scenes decomposes motion into global camera motion and local object motion in multi-plane-image space, then extrapolates local 3D motion to future views (Somraj et al., 2022). Translation-based Video Synthesis surveys domain translation methods whose main challenge is temporal coherence under appearance transformation (Saha et al., 2024). These fields matter to TVS because they supply temporal reconstruction, stabilization, and consistency mechanisms, but their primary objective is content generation rather than screening or triage.

Community retinal screening systems such as ECVS are clinically closer to screening but currently non-temporal. ECVS uses RETQA, PVI, EDD, and VLR on single-visit retinal photographs, achieving AUC 0.98 for RETQA, 0.95 for PVI, 0.90 for EDD, and Dice 0.48 for VLR (Lei et al., 2024). The work is directly relevant as a modular baseline for longitudinal TVS, but it does not model progression or repeat-visit temporal change.

The acronym “TVS” is also heavily overloaded in the literature and must be disambiguated. In astrophysical spectroscopy, TVS denotes Temporal Variance Spectrum and smTVS denotes smoothed Temporal Variance Spectrum (Kholtygin et al., 2016). In pulsed-power engineering, TVS denotes Triggered Vacuum Switch (Park et al., 2016). In high-dimensional statistics, TVS denotes two-stage variable selection (Wang et al., 2017). In graphics and video generation, TVS can denote temporal view synthesis (Somraj et al., 2022) or, more broadly, Translation-based Video Synthesis (Saha et al., 2024). These meanings are terminologically distinct from Temporal Visual Screening and should not be conflated.

6. Limitations and open directions

Current TVS-related work is technically strong but fragmented. Several systems expose implementation ambiguities that complicate replication or deployment. TDTracker does not fully specify Binary Map construction parameters, exact clip duration, augmentation policy, or a formal branch-fusion equation, and its Frequency-aware Module is sequence-length dependent enough that it was removed in the competition setting (Ren et al., 31 Mar 2025). TPQI is primarily a whole-video metric, so segment localization of temporal degradation remains an adaptation rather than a built-in capability, and performance drops on unnatural content such as animation and gaming videos (Coclite et al., 2022). The retinal stabilization paper gives procedural descriptions of frame rejection and template selection but omits exact threshold equations, runtime measurements, and quantitative verification that SVP amplitude is preserved after stabilization (Sheng et al., 2023).

Clinical and behavioral TVS systems are limited by cohort size, external validation, and uncertainty handling. The ASD screening framework uses only 39 participants, has no external validation, and leaves several methodological details unspecified, including PCA settings, optimizer, and masking of padded time steps (Qadir et al., 3 Jun 2026). The PD screening study shows a strong held-out AUC claim but also a sensitivity of 0.56 in the more detailed validation section and a discrepancy between 0.93 and 0.95 held-out AUC reporting (Alcala-Durand et al., 1 Sep 2025). The EEG-based ocular timing framework currently includes only two healthy controls, no mTBI cohort, and no direct eye-tracker latency ground truth, so its DTW metrics are better interpreted as proof-of-concept neuro-ocular alignment measures than validated diagnostic biomarkers (Sarkar et al., 14 May 2026).

A broader limitation is that some stream-based screening systems remain temporal only operationally, not algorithmically. Automated thermal COVID screening processes thermal video streams in public spaces, but its inference pipeline is framewise, with no explicit tracking or temporal aggregation despite the continuous deployment setting (Katte et al., 2022). Conversely, some temporal-modeling papers provide strong methodology without explicit screening tasks, as in Vi-ST or TGVFM (Wu et al., 2024, Xia et al., 9 Nov 2025). The field therefore contains both screening systems that underuse temporal structure and temporal systems that stop short of screening.

These limitations collectively suggest several open directions: variable-length and sequence-agnostic temporal modules; stronger subject-disjoint and longitudinal evaluation; explicit temporal localization rather than only global scores; calibrated confidence and uncertainty-aware rejection; robust multimodal fusion across vision, gaze, and neurophysiology; and better integration of preprocessing, temporal inference, and decision support into end-to-end screening workflows. Across the current literature, the most durable conclusion is that screening quality improves when time is treated as a primary signal dimension rather than a nuisance variable.