YT360-EyeTracking Dataset for 360° Saliency Research

Updated 14 September 2025

YT360-EyeTracking Dataset is a comprehensive 360° video resource featuring controlled audio and color conditions designed to study visual saliency.
It comprises 81 omnidirectional clips across nine semantic classes, captured with high-fidelity eye-tracking in immersive VR environments.
The dataset underpins advanced models like SalViT360 and SalViT360-AV, demonstrating improved fixation prediction through multimodal integration.

The YT360-EyeTracking Dataset is a large-scale, multi-condition eye-tracking resource designed specifically for research in visual saliency prediction within omnidirectional (360-degree) video environments. Developed to address the scarcity of comprehensive datasets supporting 360-degree audio-visual saliency modeling, it enables evaluation and development of advanced algorithms that account for spherical distortions and multi-modal (audio-visual) attention drivers.

1. Dataset Composition and Structure

The YT360-EyeTracking Dataset comprises 81 omnidirectional video (ODV) clips sourced from the YouTube-360 collection, systematically categorized into nine audio-visual semantic classes (nine videos per category). Each video sequence is of 30 seconds duration, formatted in equirectangular projection (ERP) at 3840×1920 resolution, and acquired at 24–30 frames per second. Videos are provided under three distinct audio conditions—mute, mono, and full spatial (ambisonic) audio—and are also available in colored and grayscale versions, enabling controlled studies of both color and auditory attention cues.

Each clip is viewed by a pool of at least 15 participants per condition, summing to a total of 102 unique subjects with gender balance. Precise ground-truth fixations are recorded under immersive VR conditions using high-fidelity hardware (HTC Vive Pro-Eye paired with Tobii Eye Tracker), capturing naturalistic gaze patterns. All events are annotated according to a standardized fixation detection protocol (I-DT, dispersion threshold 1.5 visual degrees, minimum duration 0.1 s), followed by resolution normalization.

2. Data Collection Methodology

Participants were subjected to a well-structured free-viewing protocol in a virtual reality environment. Prior to each session, personalized eye calibration ensured optimal tracking accuracy. Each participant observed a pseudo-randomized sequence of 27 video stimuli (30 seconds each), with interleaved 10-second black screens to minimize inter-trial carry-over. Sessions were operationalized in three blocks, separated by 5-minute breaks to prevent fatigue, with statistical controls ensuring that each subject saw each video only once and under a unique audio-visual configuration.

Pre-processing procedures included bicubic interpolation to ensure consistent resolution across videos, and in-pipeline generation of grayscale stimuli to examine chromatic contributions to saliency. All stimuli and corresponding eye-tracking data were synchronized to the frame level.

3. Audio-Visual Conditions and Annotations

A distinctive feature of YT360-EyeTracking is its systematic manipulation of audio modality and color for each video scene:

Audio: Three settings—no audio (mute), mono audio, and first-order ambisonics (spatial audio).
Color: Two settings—colored and grayscale.
Semantic Categories: Nine categories, each reflecting different real-world attention dynamics.

The resulting matrix enables factorial analyses of auditory and visual features on attention allocation. For every video, at least 15 participants contribute fixation data, which are processed to create per-frame, per-condition ground-truth saliency maps.

4. Saliency Prediction Models and Methodological Advances

The dataset directly supports the development and evaluation of 360° saliency prediction architectures. The seminal work introduces SalViT360 (visual-only) and SalViT360-AV (audio-visual), both leveraging vision transformer backbones adapted for spherical geometry.

SalViT360: Employs a gnomonic projection to produce tangent (locally undistorted) viewports, which are encoded and processed by a transformer with spherical geometry-aware positional encodings.
Viewport Spatio-Temporal Attention (VSTA): Implements a two-phase self-attention block—first aggregating temporal context within each viewport, then spatial context between viewports:

$\text{VSTA}(z^{(l)}_{t,f}) = \text{VSA}(\text{VTA}(z^{(l)}_{t,f}))$

where $\text{VTA}(\cdot)$ denotes viewport-temporal attention and $\text{VSA}(\cdot)$ denotes viewport-spatial attention.

SalViT360-AV: Introduces a modality fusion pipeline by associating each viewport with a rotated first-order ambisonic (FOA) audio stream, then encoding audio features using a pre-trained backbone (PaSST), and merging via transformer adapters.

Fusion between visual and audio tokens is realized through a parameter-efficient adapter structure:

$\hat{z}_{av,t} = \text{ReLU}(\text{LN}(\bar{z}_{av,t})W_{down})W_{up}$

$z_{av,t} = \text{MLP}(\bar{z}_{av,t}) + s \cdot \hat{z}_{av,t} + \bar{z}_{av,t}$

where $W_{down}$ and $W_{up}$ are down- and up-projection matrices, $s$ is a scaling constant, LN is layer normalization, and MLP is a position-wise feed-forward network.

5. Benchmarking and Quantitative Evaluation

Multiple metrics are used for model benchmarking, including Normalized Scanpath Saliency (NSS), KL-Divergence (KLD), Pearson’s Correlation Coefficient (CC), and Similarity Metric (SIM) computed against ground-truth fixation density maps. The dataset itself is split into canonical training and test splits for reproducible evaluation.

Results indicate that SalViT360 and SalViT360-AV consistently outperform both prior visual-only and 2D audio-visual saliency models. SalViT360-AV, in particular, yields higher NSS and CC and lower KLD when spatial audio is present, underlining the importance of audio-visual coupling in attention prediction for 360° content. A plausible implication is the irreducibility of auditory information in accurately localizing human fixations in complex omnidirectional environments.

6. Spatial Audio Integration and Implications

Spatial (ambisonic) audio is integrated by rotating each FOA channel to simulate the direction-specific audio observed from distinct viewports, followed by decoding to mono signals used in the audio fusion pipeline. This mechanism enables the study of attention shifts evoked by directionally localized sound sources, providing an empirical basis for cross-modal saliency research.

The observed improvements in audio-visual models highlight the operational role of spatial audio cues in directing attention—suggesting a multidimensional approach is necessary for realistic modeling of perceptual prioritization in VR.

7. Applications and Future Directions

The YT360-EyeTracking Dataset and its associated models are foundational assets in several domains:

Foveated rendering and adaptive streaming: Saliency predictions enable perceptual optimization of computational resources.
Video compression and quality assessment: Perceptually salient regions can be prioritized in codec resource allocation.
Immersive analytics and interface design: Data-driven attention maps can inform layout and interaction strategies in VR.
Cross-modal research: The controlled multimodal structure allows for hypothesis testing on the interplay of visual and auditory stimuli in attention.

By providing controlled permutations over audio and visual cues, high subject diversity, richly annotated ground-truth, and a protocolized acquisition methodology, YT360-EyeTracking defines a comprehensive benchmark for the study and development of advanced saliency models—promoting reproducibility and rigorous comparison within the research community (Cokelek et al., 27 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to YT360-EyeTracking Dataset.