Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 162 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 38 tok/s Pro
GPT-5 High 35 tok/s Pro
GPT-4o 104 tok/s Pro
Kimi K2 164 tok/s Pro
GPT OSS 120B 426 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

SalViT360: Transformer for 360° Saliency

Updated 1 September 2025
  • SalViT360 is a family of transformer-based models for omnidirectional saliency prediction that mitigates geometric distortions using tangent image projections and spherical position embeddings.
  • The model employs an encoder–transformer–decoder pipeline with decomposed spatio-temporal attention and spatial audio adapters to fuse visual and auditory cues.
  • Evaluations show that SalViT360 achieves higher NSS, lower KLD, and improved CC/SIM on benchmarks, enabling applications in VR streaming, video compression, and quality assessment.

SalViT360 is a family of transformer-based models for visual and audio-visual saliency prediction in omnidirectional (360°) videos, designed to address the unique geometric and perceptual challenges of spherical content. SalViT360 models leverage tangent image projections, spherical geometry-aware attention, and unsupervised regularization to produce saliency maps that closely align with human gaze behavior in immersive environments. The most recent extension includes audio-visual fusion via spatial audio-aware transformer adapters, establishing state-of-the-art results on newly introduced and established benchmarks (Cokelek et al., 27 Aug 2025, Cokelek et al., 2023, Yun et al., 2022).

1. Model Architecture and Geometric Rationales

SalViT360 employs an encoder–transformer–decoder pipeline that processes 360° video frames in both visual and audio-visual domains. Each frame, represented in equirectangular projection (ERP), is re-projected via gnomonic projection into a set of tangent images or "viewports" covering the sphere. These tangent images serve as locally undistorted representations, enabling effective feature extraction and aggregation.

Key architectural components:

  • Encoder: A pretrained 2D CNN (e.g., ResNet-18, ResNet-50) extracts high-dimensional feature tensors from each tangent view. Spherical positional embeddings—derived from per-pixel angular coordinates (ϕ,θ)(\phi, \theta)—are mapped to the feature space using a learned function F(ϕ,θ)\mathcal{F}(\phi,\theta) and fused additively with visual features.
  • Transformer: The main transformer block employs a decomposed spatio-temporal attention mechanism, termed Viewport Spatio-Temporal Attention (VSTA). VSTA first applies temporal attention among tangent views across consecutive frames (Viewport Temporal Attention, VTA), then spatial attention among all tangent views within the same frame (Viewport Spatial Attention, VSA):

VSTA(zt,f(l))=VSA(VTA(zt,f(l)))\text{VSTA}\left(z_{t,f}^{(l)}\right) = \text{VSA}\left(\text{VTA}\left(z_{t,f}^{(l)}\right)\right)

Where qq, kk, vv are token projections and SM\text{SM} is softmax:

VTA(zt,f(l))=SM(qt,f(l){kt,f(l)T})×{vt,f(l)},f=1,,F\text{VTA}(z_{t,f}^{(l)}) = \text{SM}(q_{t,f}^{(l)} \cdot \{k_{t,f'}^{(l)T}\}) \times \{v_{t,f'}^{(l)}\}, \quad f' = 1, \ldots, F

VSA(zt,f(l))=SM(qt,f(l){kt,f(l)T})×{vt,f(l)},t=1,,T\text{VSA}(z_{t,f}^{(l)}) = \text{SM}(q_{t,f}^{(l)} \cdot \{k_{t',f}^{(l)T}\}) \times \{v_{t',f}^{(l)}\}, \quad t' = 1, \ldots, T

  • Decoder: A multi-layer CNN reconstructs dense per-pixel saliency predictions, which are inverse-projected to the ERP domain, resolving any overlap via averaging.

This architecture is geometrically principled, allowing precise modeling of global context while mitigating spherical distortions and supporting transfer learning from perspective image models.

2. Audio-Visual Fusion via Spatial Audio Adapters

SalViT360-AV extends SalViT360 by integrating spatial audio cues through transformer adapter modules:

  • Spatial Audio Processing: ODVs with first-order ambisonics (FOA) audio are processed so that, for each tangent viewport, FOA channels are rotated using spherical harmonics:

αN(t)=RαN(t)\alpha'_N(t) = R \cdot \alpha_N(t)

where RR is the rotation matrix determined by the viewport’s central direction.

  • Audio Feature Extraction and Fusion: The rotated, viewport-specific audio is decoded to mono via FOA formulas (e.g., F=(2W+X)×2F = (\sqrt{2} W + X) \times 2) and passed through an audio backbone (e.g., PaSST). Extracted audio features are fused with visual tokens within each transformer block using lightweight adapters:

z^av,t=ReLU(LN(z~av,t)Wdown)Wup\hat{z}_{av,t} = \text{ReLU}\left(\text{LN}\left(\tilde{z}_{av,t}\right) \cdot W_{\text{down}}\right) \cdot W_{\text{up}}

zav,t=MLP(z~av,t)+sz^av,t+z~av,tz_{av,t} = \text{MLP}(\tilde{z}_{av,t}) + s \cdot \hat{z}_{av,t} + \tilde{z}_{av,t}

where ss is a scaling factor.

Directional sound cues influence gaze prediction, enabling the model to prioritize regions of visual interest that also contain salient audio events.

3. Unsupervised Regularization for Tangent Consistency

To address artefacts resulting from overlapping tangent views in projection and inverse-projection steps, SalViT360 introduces Viewport Augmentation Consistency (VAC), a regularization that enforces agreement between predictions from different tangent tilings:

  • VAC Loss: Given predictions PP and PP' from original and augmented tangent sets and weighting matrix wi,jw_{i,j} for overlap regions:

LVAC(P,P)=LKLDweighted(P,P)+LCCweighted(P,P)\mathcal{L}_{VAC}(P, P') = \mathcal{L}_{KLD}^{weighted}(P,P') + \mathcal{L}_{CC}^{weighted}(P,P')

LKLDweighted(P,P)=i,jPi,jlog(ϵ+Pi,jPi,j+ϵ)wi,j\mathcal{L}_{KLD}^{weighted}(P,P') = \sum_{i,j} P_{i,j} \log\left(\epsilon + \frac{P_{i,j}}{P'_{i,j} + \epsilon}\right) \cdot w_{i,j}

LCCweighted(P,P)=1((PP)w)(PP)(PP)\mathcal{L}_{CC}^{weighted}(P,P') = 1 - \frac{(\sum (P \odot P') \cdot w)}{\sqrt{\sum (P \odot P) \sum (P' \odot P')} }

  • This term effectively suppresses discontinuities and encourages robust, spatially consistent saliency across the sphere. Testing uses only a single tangent set, incurring no runtime overhead.

4. Datasets, Benchmarks, and Evaluation Protocols

Performance validation for SalViT360 and its variants utilizes several established and new datasets:

Dataset Modality Subjects Audio Conditions Resolution
VR-EyeTracking Visual 50+ Mute 4K Equirectangular
PVS-HMEM Visual Mute
360AV-HM Visual+Audio FOA/mono/mute
YT360-EyeTracking Visual+Audio 100+ Mute/mono/FOA 4K/30s

Evaluation employs metrics including Normalized Scanpath Saliency (NSS), Kullback–Leibler Divergence (KLD), Pearson’s Correlation Coefficient (CC), and Similarity (SIM).

SalViT360 consistently achieves higher NSS, lower KLD, and higher CC/SIM than previous methods. For example, cross-dataset NSS routinely exceeds prior bests (e.g., while other methods obtain NSS ≈ 2.63, SalViT360 outperforms on most splits). The audio-visual model SalViT360-AV further improves results, especially in the presence of spatial audio.

5. Spherical Geometry-Aware Attention Mechanisms

Central to the SalViT360 design are geometry-aware attentional processes:

  • Tangent Image Representation: Gnomonic projection produces undistorted viewports such that conventional convolutional and transformer architectures can be utilized without modification. This retains local photometric and spatial alignment, overcoming the limitations of equirectangular or cubemap projections.
  • Spherical Position Embedding: Each token receives a learnable embedding corresponding to its angular coordinates. The embedding is fused with the visual features ensuring that attention mechanisms respect the sphere’s geometry, especially near the poles where distortion is maximal.
  • Attention Decomposition: Rather than joint spatio-temporal attention (which is computationally prohibitive for high-dimensional video data), SalViT360 employs two-stage VSTA—first aggregating temporal dependencies (VTA), then spatial dependencies (VSA)—efficiently capturing both dynamic and spatial saliency cues.

This attention mechanism is particularly suited for omnidirectional vision, enabling context-aware modeling across the entire 360° field.

6. Applications and Research Implications

SalViT360 models have substantial utility across VR and multimedia disciplines:

  • Saliency-Guided Video Compression: Foveated encoding schemes prioritize transmission and rendering quality for regions likely to be observed, using predicted saliency maps as weighting factors.
  • Immersive Video Streaming and Rendering: SalViT360’s saliency maps inform the allocation of resources in real-time rendering, especially for head-mounted display environments.
  • Omnidirectional Video Quality Assessment: Saliency maps act as perceptual weights in PSNR, WS-PSNR, and S-PSNR metrics, improving alignment with subjective human ratings (DMOS).
  • Perceptual Studies: The YT360-EyeTracking dataset enables systematic analysis of audio-visual gaze behavior; integration of spatial audio is shown to influence viewer attention, especially in complex, multi-source scenarios.

A plausible implication is that the fusion of geometry-aware and multi-modal attentional processes may generalize to segmentation, depth estimation, and other dense prediction tasks on spherical domains.

7. Future Directions

Suggestions for future exploration include:

  • Multi-modal Fusion Enhancement: Investigation of advanced transformer fusion modules—e.g., cross-attention between audio and visual branches or dynamic modality weighting based on content.
  • Long-Range Temporal Modeling: Extending VSTA to capture dependencies over longer temporal windows while maintaining computational feasibility.
  • Rich Spherical Audio Representation: Utilization of higher-order ambisonics, more granular spatial audio features, or integrating audio source localization.
  • Integration with Downstream Tasks: Fine-tuning SalViT360 for tasks like salient object segmentation, VR-based activity recognition, or interactive content adaptation.

These avenues reflect the model’s versatility for immersive analysis and its ongoing influence in the development of robust omnidirectional multimedia understanding models.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SalViT360.