Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos (1806.01320v1)

Published 4 Jun 2018 in cs.CV

Abstract: Automatic saliency prediction in 360{\deg} videos is critical for viewpoint guidance applications (e.g., Facebook 360 Guide). We propose a spatial-temporal network which is (1) weakly-supervised trained and (2) tailor-made for 360{\deg} viewing sphere. Note that most existing methods are less scalable since they rely on annotated saliency map for training. Most importantly, they convert 360{\deg} sphere to 2D images (e.g., a single equirectangular image or multiple separate Normal Field-of-View (NFoV) images) which introduces distortion and image boundaries. In contrast, we propose a simple and effective Cube Padding (CP) technique as follows. Firstly, we render the 360{\deg} view on six faces of a cube using perspective projection. Thus, it introduces very little distortion. Then, we concatenate all six faces while utilizing the connectivity between faces on the cube for image padding (i.e., Cube Padding) in convolution, pooling, convolutional LSTM layers. In this way, CP introduces no image boundary while being applicable to almost all Convolutional Neural Network (CNN) structures. To evaluate our method, we propose Wild-360, a new 360{\deg} video saliency dataset, containing challenging videos with saliency heatmap annotations. In experiments, our method outperforms baseline methods in both speed and quality.

Citations (175)

Summary

  • The paper introduces a novel Cube Padding technique that maps spherical content onto cube faces to minimize distortion in 360° video saliency prediction.
  • It integrates Cube Padding into convolutional, pooling, and ConvLSTM layers to efficiently aggregate spatiotemporal features under weak supervision.
  • Experimental results using the Wild-360 dataset demonstrate improved prediction quality and computational speed, benefiting VR and autonomous systems.

Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos

The research paper "Cube Padding for Weakly-Supervised Saliency Prediction in 360° Videos" explores innovative methods for saliency prediction in 360° videos—a task increasingly vital for applications in virtual reality and perspective guidance systems. Recognizing the limitations of existing methods, the paper introduces a spatial-temporal network, trained under weakly-supervised conditions, that is specifically adapted for processing 360° viewing spheres.

One of the significant issues in conventional approaches is the reliance on equirectangular images. These are problematic because they introduce distortion, especially near the polar regions and image boundaries, complicating saliency prediction. The proposed method circumvents these problems by employing a novel "Cube Padding" (CP) technique. This technique involves rendering the spherical video content onto the six faces of a cube using perspective projection. This approach minimizes distortion and eliminates separations at image boundaries, thus maintaining spatial correlations across all faces of the cube.

The core technical contribution is the CP technique's integration into convolution, pooling, and convolutional LSTM layers, offering compatibility with existing CNN architectures. Moreover, the Unsurpervised ConvLSTM model aggregates temporal information, using optical flow to enforce temporal consistency between consecutive saliency maps, thus enhancing motion-awareness.

Experimental results are bolstered by the introduction of "Wild-360", a new dataset designed for saliency evaluation in 360° videos. This dataset includes challenging video sequences annotated with saliency heatmaps, promoting robust evaluation relative to current benchmarks.

The paper demonstrates that its proposed network outperforms existing baselines significantly in both quality of prediction and computational speed. Noteworthy computational efficiency is observed where the proposed network maintains better performance across varying video resolutions, establishing it as an expedient solution for real-time applications.

Implications and Future Directions:

This research opens up avenues for more effective processing of 360° content in real-time, enhancing applications in VR and trainings of autonomous systems. Practically, integration into VR platforms could facilitate smoother transitions and improved user experience, as saliency predictions guide viewpoint adjustments more accurately. Theoretically, this introduces adaptive models that are less reliant on exhaustive manual annotation, stimulating further interrogation into weakly-supervised learning paradigms.

Looking ahead, developing more complex interconnections between layers could potentially extend CP functionality, fostering even more sophisticated models capable of capturing nuanced content relationships in spherical videos. Furthermore, the exploration of additional unsupervised temporal consistency techniques poses interesting avenues for research, presenting the potential for models that adapt smoothly to dynamically evolving input, critical in fluctuating real-world environments.

This paper represents a notable stride in the paper of saliency prediction for 360° videos, advancing both the practical deployment and theoretical understanding of spatial-temporal networks in complex visual domains.