YT360-EyeTracking: 360° Saliency Dataset & Models

Updated 1 September 2025

YT360-EyeTracking is a comprehensive multimodal benchmark featuring a high-resolution dataset and deep learning models tailored for saliency prediction in immersive 360° videos.
It employs a novel transformer architecture with spherical geometry-aware embeddings and VAC loss to address ERP distortions and ensure viewport consistency.
The system integrates spatial audio via lightweight adapter modules, enhancing cross-modal fusion and improving overall predictive accuracy in immersive environments.

YT360-EyeTracking refers to both a cutting-edge multimodal dataset and a set of deep learning models designed to advance audio-visual saliency prediction in 360-degree (omnidirectional) video environments (Cokelek et al., 27 Aug 2025). These systems address the complexities of spherical video geometry and the integration of spatial audio, establishing new benchmarks and architectures for understanding and predicting human attention in immersive VR scenes.

1. Dataset Construction and Structure

The YT360-EyeTracking dataset comprises 81 omnidirectional videos (ODVs), each 30 seconds in duration and formatted in high-resolution ERP (3840×1920). Sourced from YouTube-360, the selection ensures compositional diversity and technical consistency. For each video, three audio conditions are provided: mute, mono, and first-order ambisonics (spatial audio), alongside two color conditions (color and grayscale) to enable controlled analysis of chromatic and luminance cues.

Eye-tracking data were acquired from 102 participants, each viewing clips under randomized audio-visual configurations. Every video is observed by at least 15 subjects, yielding robust fixation maps for benchmarking cross-subject attentional consistency. Spatial audio is decoded and rotated per viewport using spherical harmonic matrices to align audio features with the corresponding visual field. This design provides fine granularity for studying the interaction of audio and visual cues in directing viewer gaze.

2. Model Architectures: SalViT360 and SalViT360-AV

SalViT360 is a vision transformer tailored for audio-visual saliency prediction in omnidirectional video:

Input images are gnomonic-projected into tangent viewports, yielding locally undistorted visual fields.
An encoder–transformer–decoder structure processes temporal and spatial dependencies.
Spherical geometry–aware positional embeddings encode the 3D structure of equirectangular video, modulating the spatial attention computations to account for inherent ERP distortions.

The Viewport Spatio-Temporal Attention (VSTA) module implements a specific order of attention: first, temporal attention (VTA) among frames for a given viewport; then, spatial attention (VSA) across viewports for a given temporal slice:

$\text{VSTA}(z_{t, f}^{(l)}) = \text{VSA}(\text{VTA}(z_{t, f}^{(l)}))$

where VTA and VSA are defined according to query-key-value mechanism with spherical-aware embeddings.

Viewports are further regularized by the unsupervised Viewport Augmentation Consistency (VAC) loss, which promotes consistency in saliency map predictions from overlapping tangent projections.

SalViT360-AV extends this architecture with transformer adapters that fuse spatial audio features. Ambisonic waveforms, rotated according to viewport geometry, are encoded using a self-attention audio backbone (PaSST). Audio tokens are fused with visual tokens via lightweight adapters:

$\hat{z}_{av, t} = \mathrm{ReLU}(\mathrm{LN}(\bar{z}_{av, t}) W_\text{down}) W_\text{up}$

$z_{av, t} = \mathrm{MLP}(\bar{z}_{av, t}) + s \cdot \hat{z}_{av, t} + \bar{z}_{av, t}$

The adapters allow efficient parameter modulation and maintain visual-only predictions in the absence of audio.

3. Experimental Evaluation and Quantitative Results

The models are validated on the YT360-EyeTracking dataset, as well as other large benchmarks like VR-EyeTracking, PVS-HMEM, and AVS-ODV. Evaluation metrics include:

Normalized Scanpath Saliency (NSS)
Pearson’s Correlation Coefficient (CC)
Kullback-Leibler Divergence (KLD)
Similarity Metric (SIM)

SalViT360 and SalViT360-AV achieve state-of-the-art results, consistently outperforming previous saliency predictors (including ATSal, Two-Stream, and spherical ConvLSTM variants) under multiple conditions. The integration of spatial audio yields further performance gains in ambisonic scenarios.

Predicted saliency maps display reduced center and equator biases. Qualitative analyses confirm that these maps align well with audibly and visually salient regions, especially in complex panoramic scenes.

4. Impact of Audio-Visual Fusion on Attention Prediction

Empirical results show that the incorporation of spatial audio cues in SalViT360-AV increases both predictive accuracy and subject-to-subject fixation map consistency, compared with visual-only methods. Under ambisonic audio, viewer fixations concentrate near sound sources, a phenomenon robustly captured by the model.

Rotational alignment between the audio and visual fields, accomplished via spherical harmonic transformations, is essential for meaningful cross-modal fusion. The adapters allow SalViT360-AV to efficiently leverage audio when present and gracefully revert to visual cues otherwise.

A plausible implication is that audio modality is necessary for accurate saliency prediction in realistic immersive environments, and models lacking such integration may systematically mislocalize attention under multi-source scenarios.

5. Spherical Geometry and Multi-Viewport Handling

Gnomonic projection and spherical-aware embedding are central for effective saliency modeling in omnidirectional geometry. The handling of tangent viewports and associated spatial transformations overcomes the severe distortion issues typical in ERP images.

VAC loss leverages the redundancy in overlapping tangent projections to regularize predictions and minimize inconsistencies arising from geometric projection artifacts.

This approach allows saliency models to generalize better across camera motions and scene compositions, suggesting the methodological trend is towards full geometry-aware, multi-modal learning for 360-degree attention tasks.

6. Applications, Availability, and Future Directions

Immediate applications include adaptive viewport generation for VR/AR navigation, perceptually-driven video streaming, content summarization, and subjective video quality assessment.

Code and dataset for SalViT360 and SalViT360-AV will be released at https://cyberiada.github.io/SalViT360, providing a resource for further research in 360-degree audio-visual analysis.

Potential future directions are:

Exploring alternative multi-modal fusion strategies, including joint end-to-end training of audio and visual backbones.
Quantitative investigation into how spatial audio affects not only saliency prediction accuracy, but also the temporal dynamics of gaze behavior in immersive scenes.
Application of these models for optimizing user experience metrics in next-generation VR systems and for automatic content reformatting in multi-view or multichannel environments.

This work establishes the YT360-EyeTracking framework as a comprehensive benchmark and solution for understanding and predicting attention in highly immersive, multimodal virtual environments, with technical innovations spanning dataset design, spherical transformer architectures, and efficient cross-modal fusion.

PDF Markdown Chat (Pro)

References (1)

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to YT360-EyeTracking.