SalViT360-AV: 360° Audiovisual Saliency Framework

Updated 1 September 2025

The paper presents a vision-transformer framework that fuses locally tangent viewports with spatial audio cues to predict saliency in omnidirectional videos.
It employs dual-branch processing with audio-conditioned adapters and separate spatio-temporal attention modules to overcome distortion and fusion challenges.
Empirical results on the YT360-EyeTracking dataset demonstrate superior performance across NSS, KLD, CC, and SIM metrics using cross-modal fusion.

SalViT360-AV is a vision-transformer-based framework for predicting audiovisual saliency in omnidirectional (360°) videos, explicitly integrating directional spatial audio cues into the transformer architecture via audio-conditioned adapters. Developed to address the limitations of visual-only saliency prediction in immersive environments, SalViT360-AV leverages locally undistorted tangent viewports, transformer-based spatio-temporal mixing, and ambisonic spatial audio alignment to achieve state-of-the-art accuracy on multiple saliency benchmarks, notably with the introduction and utilization of the YT360-EyeTracking dataset (Cokelek et al., 27 Aug 2025).

1. Architectural Foundations and Spherical Geometry Handling

SalViT360-AV is based on an extension of the SalViT360 model, which employs locally tangent viewport extraction via gnomonic projection to minimize equirectangular distortion in 360° video. Each frame is decomposed into $T$ tangent images, typically with $F$ total frames per clip. These viewports, typically with an $\approx 80^\circ$ field of view, are encoded using a convolutional neural network (CNN), pretrained on standard 2D visual datasets. This stage yields feature maps that capture local textures and objects, unaffected by projection-induced artifacts.

Spatial context is preserved by augmenting feature tokens with spherical geometry-aware positional embeddings computed from angular coordinates $(\phi, \theta)$ of each tangent viewport. These embeddings serve two roles: informing the transformer of the spatial location of each token on the omnidirectional sphere, and facilitating geometric consistency when aggregating multi-viewport predictions.

A transformer module implements the Viewport Spatio-Temporal Attention (VSTA) mechanism. VSTA decomposes self-attention into:

Temporal attention (VTA): computes attention among tokens corresponding to a given viewport across multiple frames,
Spatial attention (VSA): aggregates information from different viewports within a single frame.

This separation maintains computational efficiency and respects the underlying geometry. Mathematically, for token $z^{(\ell)}_{t,f}$ at layer $\ell$ for viewport $t$ and frame $f$ :

$\text{VSTA}(z^{(\ell)}_{t,f}) = \text{VSA}(\text{VTA}(z^{(\ell)}_{t,f}))$

In each case, attention weights and updates are computed via standard query-key-value softmax mixing (see Details above).

2. Audio–Visual Integration via Transformer Adapters

SalViT360-AV introduces a dual-branch architecture with explicit spatial audio processing. The audio input consists of 360° ambisonic signals (typically FOA 4-channel B-format), aligned at each viewport by rotating the ambisonic channels according to each tangent's geometric center. Directional decoding transforms the rotated ambisonics into a mono waveform per tangent, approximating what a human listener would perceive in that direction.

These audio samples are converted to mel-spectrograms and encoded via a dedicated audio backbone (e.g., PaSST, a transformer-based spectrogram encoder), yielding robust audio feature tokens per viewport.

Transformer adapters—implemented as small bottleneck MLP modules—are inserted at each layer of the visual transformer branch. Adapters receive both the visual token and the audio feature for the corresponding tangent, combining them via additive scaling. The typical operation is:

$\begin{align*} \triangle z_{av,t} &= \text{ReLU}(\text{LN}(\bar{z}_{av,t}) W_{down}) W_{up} \ z_{av,t} &= \text{MLP}(\bar{z}_{av,t}) + s \cdot \triangle z_{av,t} + \bar{z}_{av,t} \end{align*}$

where $s$ is a learnable scale and $W_{down}, W_{up}$ are projection matrices. This mechanism allows the model to fuse visual and spatial audio cues efficiently, only fine-tuning a small parameter subset.

3. Training, Regularization, and Inverse Projection

To produce the final 360° saliency map, SalViT360-AV decodes each tangent’s output using a lightweight CNN decoder and projects it back to the equirectangular domain using an inverse gnomonic transformation. To address prediction inconsistencies in overlapping tangent regions, unsupervised Viewport Augmentation Consistency (VAC) loss is used during training. VAC operates by enforcing agreement between predictions made on alternative tangent viewport selections and is formulated as the sum of weighted Kullback–Leibler divergence and correlation coefficient terms over the overlapping regions:

$\mathcal{L}_\text{VAC}(P, P') = \mathcal{L}_\text{KLD}^{weighted}(P, P') + \mathcal{L}_\text{CC}^{weighted}(P, P')$

This regularization is exclusively performed during training and does not add inference cost.

4. Datasets and Evaluation Protocols

A major methodological advancement in SalViT360-AV is the curation of YT360-EyeTracking, comprising 81 ERP omnidirectional videos (3840 $\times$ 1920 px, 24–30 fps, 30 seconds/clip) annotated under three audio conditions: mute, mono, and ambisonics, across color and grayscale versions. Eye-tracking data from 102 HMD-equipped participants (VR free-viewing) enables precise ground-truth attention map creation (Cokelek et al., 27 Aug 2025).

Benchmark evaluations additionally reference VR-EyeTracking, PVS-HMEM, and 360AV-HM datasets. Saliency metrics include Normalized Scanpath Saliency (NSS), KL-divergence (KLD), Pearson’s Correlation Coefficient (CC), and Similarity Metric (SIM), all calculated against ground-truth fixation maps derived from experiment participants.

5. Empirical Results and Key Findings

Quantitative and ablation studies demonstrate that SalViT360-AV outperforms all previous visual-only and bi-modal saliency models on NSS, KLD, CC, and SIM. Performance gains are most pronounced with spatial audio integration; transformer adapters conditioned on spatial audio yield higher predictive accuracy compared to the basic visual branch or mono audio modalities.

Qualitative inspection reveals more robust and localized highlighting of salient regions, especially those corresponding to dynamic or contextually relevant sound sources (e.g., people speaking, vehicles moving). Integration of directional audio is seen to resolve ambiguities present in visually ambiguous contexts, closely matching actual viewer gaze allocation.

6. Applications, Broader Implications, and Resources

Accurate 360° audio-visual saliency prediction is foundational for adaptive video streaming, saliency-guided compression, and user-centric quality assessment in VR. By capturing multisensory guidance of human attention, SalViT360-AV enables more efficient scene understanding, resource allocation, and personalized media experiences in panoramic environments.

The codebase, pretrained weights, and dataset are publicly available, facilitating reproducibility and further research at https://cyberiada.github.io/SalViT360.

The empirical superiority of spatial audio integration in omnidirectional saliency prediction underscores the importance of transformer-based cross-modal fusion, spherical geometry-aware modeling, and parameter-efficient adaptation frameworks for immersive media analysis.

7. Prospects and Future Research Trajectories

SalViT360-AV exemplifies a scalable paradigm for transformer-based multi-modal learning with parameter-efficient fine-tuning. Immediate future directions include:

Incorporation of higher-order ambisonics for improved spatial resolution in audio branch processing;
Exploration of alternative cross-modal fusion modules (e.g., gated fusion, attention-based adapters);
Extension to other 360° tasks such as embodied navigation, panoramic question answering, and saliency-conditioned video generation;
Expansion and diversification of benchmark datasets with richer annotations and more naturalistic scenes.

In summary, SalViT360-AV establishes a technical reference for omnidirectional audiovisual saliency modeling, offering reproducible resources and opening future investigations into holistic attention prediction in immersive environments.

PDF Markdown Chat (Pro)

References (1)

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360-Degree Videos (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SalViT360-AV.