SalViT360: Transformer for 360° Saliency

Updated 1 September 2025

SalViT360 is a family of transformer-based models for omnidirectional saliency prediction that mitigates geometric distortions using tangent image projections and spherical position embeddings.
The model employs an encoder–transformer–decoder pipeline with decomposed spatio-temporal attention and spatial audio adapters to fuse visual and auditory cues.
Evaluations show that SalViT360 achieves higher NSS, lower KLD, and improved CC/SIM on benchmarks, enabling applications in VR streaming, video compression, and quality assessment.

SalViT360 is a family of transformer-based models for visual and audio-visual saliency prediction in omnidirectional (360°) videos, designed to address the unique geometric and perceptual challenges of spherical content. SalViT360 models leverage tangent image projections, spherical geometry-aware attention, and unsupervised regularization to produce saliency maps that closely align with human gaze behavior in immersive environments. The most recent extension includes audio-visual fusion via spatial audio-aware transformer adapters, establishing state-of-the-art results on newly introduced and established benchmarks (Cokelek et al., 27 Aug 2025, Cokelek et al., 2023, Yun et al., 2022).

1. Model Architecture and Geometric Rationales

SalViT360 employs an encoder–transformer–decoder pipeline that processes 360° video frames in both visual and audio-visual domains. Each frame, represented in equirectangular projection (ERP), is re-projected via gnomonic projection into a set of tangent images or "viewports" covering the sphere. These tangent images serve as locally undistorted representations, enabling effective feature extraction and aggregation.

Key architectural components:

Encoder: A pretrained 2D CNN (e.g., ResNet-18, ResNet-50) extracts high-dimensional feature tensors from each tangent view. Spherical positional embeddings—derived from per-pixel angular coordinates $(\phi, \theta)$ —are mapped to the feature space using a learned function $\mathcal{F}(\phi,\theta)$ and fused additively with visual features.
Transformer: The main transformer block employs a decomposed spatio-temporal attention mechanism, termed Viewport Spatio-Temporal Attention (VSTA). VSTA first applies temporal attention among tangent views across consecutive frames (Viewport Temporal Attention, VTA), then spatial attention among all tangent views within the same frame (Viewport Spatial Attention, VSA):

$\text{VSTA}\left(z_{t,f}^{(l)}\right) = \text{VSA}\left(\text{VTA}\left(z_{t,f}^{(l)}\right)\right)$

Where $q$ , $k$ , $v$ are token projections and $\text{SM}$ is softmax:

$\text{VTA}(z_{t,f}^{(l)}) = \text{SM}(q_{t,f}^{(l)} \cdot \{k_{t,f'}^{(l)T}\}) \times \{v_{t,f'}^{(l)}\}, \quad f' = 1, \ldots, F$

$\text{VSA}(z_{t,f}^{(l)}) = \text{SM}(q_{t,f}^{(l)} \cdot \{k_{t',f}^{(l)T}\}) \times \{v_{t',f}^{(l)}\}, \quad t' = 1, \ldots, T$

Decoder: A multi-layer CNN reconstructs dense per-pixel saliency predictions, which are inverse-projected to the ERP domain, resolving any overlap via averaging.

This architecture is geometrically principled, allowing precise modeling of global context while mitigating spherical distortions and supporting transfer learning from perspective image models.

2. Audio-Visual Fusion via Spatial Audio Adapters

SalViT360-AV extends SalViT360 by integrating spatial audio cues through transformer adapter modules:

Spatial Audio Processing: ODVs with first-order ambisonics (FOA) audio are processed so that, for each tangent viewport, FOA channels are rotated using spherical harmonics:

$\alpha'_N(t) = R \cdot \alpha_N(t)$

where $R$ is the rotation matrix determined by the viewport’s central direction.

Audio Feature Extraction and Fusion: The rotated, viewport-specific audio is decoded to mono via FOA formulas (e.g., $F = (\sqrt{2} W + X) \times 2$ ) and passed through an audio backbone (e.g., PaSST). Extracted audio features are fused with visual tokens within each transformer block using lightweight adapters:

$\hat{z}_{av,t} = \text{ReLU}\left(\text{LN}\left(\tilde{z}_{av,t}\right) \cdot W_{\text{down}}\right) \cdot W_{\text{up}}$

$z_{av,t} = \text{MLP}(\tilde{z}_{av,t}) + s \cdot \hat{z}_{av,t} + \tilde{z}_{av,t}$

where $s$ is a scaling factor.

Directional sound cues influence gaze prediction, enabling the model to prioritize regions of visual interest that also contain salient audio events.

3. Unsupervised Regularization for Tangent Consistency

To address artefacts resulting from overlapping tangent views in projection and inverse-projection steps, SalViT360 introduces Viewport Augmentation Consistency (VAC), a regularization that enforces agreement between predictions from different tangent tilings:

VAC Loss: Given predictions $P$ and $P'$ from original and augmented tangent sets and weighting matrix $w_{i,j}$ for overlap regions:

$\mathcal{L}_{VAC}(P, P') = \mathcal{L}_{KLD}^{weighted}(P,P') + \mathcal{L}_{CC}^{weighted}(P,P')$

$\mathcal{L}_{KLD}^{weighted}(P,P') = \sum_{i,j} P_{i,j} \log\left(\epsilon + \frac{P_{i,j}}{P'_{i,j} + \epsilon}\right) \cdot w_{i,j}$

$\mathcal{L}_{CC}^{weighted}(P,P') = 1 - \frac{(\sum (P \odot P') \cdot w)}{\sqrt{\sum (P \odot P) \sum (P' \odot P')} }$

This term effectively suppresses discontinuities and encourages robust, spatially consistent saliency across the sphere. Testing uses only a single tangent set, incurring no runtime overhead.

4. Datasets, Benchmarks, and Evaluation Protocols

Performance validation for SalViT360 and its variants utilizes several established and new datasets:

Dataset	Modality	Subjects	Audio Conditions	Resolution
VR-EyeTracking	Visual	50+	Mute	4K Equirectangular
PVS-HMEM	Visual	—	Mute	—
360AV-HM	Visual+Audio	—	FOA/mono/mute	—
YT360-EyeTracking	Visual+Audio	100+	Mute/mono/FOA	4K/30s

Evaluation employs metrics including Normalized Scanpath Saliency (NSS), Kullback–Leibler Divergence (KLD), Pearson’s Correlation Coefficient (CC), and Similarity (SIM).

SalViT360 consistently achieves higher NSS, lower KLD, and higher CC/SIM than previous methods. For example, cross-dataset NSS routinely exceeds prior bests (e.g., while other methods obtain NSS ≈ 2.63, SalViT360 outperforms on most splits). The audio-visual model SalViT360-AV further improves results, especially in the presence of spatial audio.

5. Spherical Geometry-Aware Attention Mechanisms

Central to the SalViT360 design are geometry-aware attentional processes:

Tangent Image Representation: Gnomonic projection produces undistorted viewports such that conventional convolutional and transformer architectures can be utilized without modification. This retains local photometric and spatial alignment, overcoming the limitations of equirectangular or cubemap projections.
Spherical Position Embedding: Each token receives a learnable embedding corresponding to its angular coordinates. The embedding is fused with the visual features ensuring that attention mechanisms respect the sphere’s geometry, especially near the poles where distortion is maximal.
Attention Decomposition: Rather than joint spatio-temporal attention (which is computationally prohibitive for high-dimensional video data), SalViT360 employs two-stage VSTA—first aggregating temporal dependencies (VTA), then spatial dependencies (VSA)—efficiently capturing both dynamic and spatial saliency cues.

This attention mechanism is particularly suited for omnidirectional vision, enabling context-aware modeling across the entire 360° field.

6. Applications and Research Implications

SalViT360 models have substantial utility across VR and multimedia disciplines:

Saliency-Guided Video Compression: Foveated encoding schemes prioritize transmission and rendering quality for regions likely to be observed, using predicted saliency maps as weighting factors.
Immersive Video Streaming and Rendering: SalViT360’s saliency maps inform the allocation of resources in real-time rendering, especially for head-mounted display environments.
Omnidirectional Video Quality Assessment: Saliency maps act as perceptual weights in PSNR, WS-PSNR, and S-PSNR metrics, improving alignment with subjective human ratings (DMOS).
Perceptual Studies: The YT360-EyeTracking dataset enables systematic analysis of audio-visual gaze behavior; integration of spatial audio is shown to influence viewer attention, especially in complex, multi-source scenarios.

A plausible implication is that the fusion of geometry-aware and multi-modal attentional processes may generalize to segmentation, depth estimation, and other dense prediction tasks on spherical domains.

7. Future Directions

Suggestions for future exploration include:

Multi-modal Fusion Enhancement: Investigation of advanced transformer fusion modules—e.g., cross-attention between audio and visual branches or dynamic modality weighting based on content.
Long-Range Temporal Modeling: Extending VSTA to capture dependencies over longer temporal windows while maintaining computational feasibility.
Rich Spherical Audio Representation: Utilization of higher-order ambisonics, more granular spatial audio features, or integrating audio source localization.
Integration with Downstream Tasks: Fine-tuning SalViT360 for tasks like salient object segmentation, VR-based activity recognition, or interactive content adaptation.

These avenues reflect the model’s versatility for immersive analysis and its ongoing influence in the development of robust omnidirectional multimedia understanding models.