Papers
Topics
Authors
Recent
2000 character limit reached

iSQoE: No-Reference Stereoscopic SQoE Metric

Updated 18 December 2025
  • iSQoE is a data-driven, no-reference metric that quantifies perceptual comfort and stereo fusion fidelity in VR displays using learned human preference judgments.
  • It employs a Siamese vision transformer with cross-attention fusion to process rectified stereo pairs, enhancing robustness across various distortion types.
  • Evaluation on real-world datasets shows iSQoE correlates strongly with human VR assessments, outperforming traditional pixel-wise and geometry-centric metrics.

iSQoE is a data-driven, no-reference stereoscopic Quality of Experience (SQoE) metric intended to quantify perceptual comfort and stereo fusion fidelity as experienced during immersive stereo image viewing, particularly in head-mounted VR displays. It is realized as a learned predictor trained on human VR preference judgments. iSQoE outputs a single scalar for a rectified stereo pair, with lower values indicating higher predicted user comfort. It serves as a downstream-relevant benchmark for evaluating monocular-to-stereo synthesis, 2D-to-3D conversion, and related tasks, complementing geometry-centric metrics by directly targeting perceptual plausibility rather than pixel-wise fidelity or geometric correctness (Behrens et al., 11 Dec 2025, Tamir et al., 30 Dec 2024).

1. Motivation and Theoretical Foundations

The rapid advancement of stereo generation algorithms and VR display technology has exposed the limitations of legacy stereo image quality metrics, which typically focus on either photometric similarity (e.g., PSNR, SSIM) or local distortion severity, neglecting emergent phenomena such as visual discomfort, stereo rivalry, or fusion breakdown. These factors critically affect the immersive experience: plausible geometric content may induce vergence-accommodation conflicts, while over-smoothed or artifact-laden images may disrupt stereo fusion despite low pixel-wise error.

iSQoE addresses these challenges by learning directly from VR headset-based human preference data, yielding an SQoE metric that reflects the latent perceptual space in which discomfort, fusion, and realism are evaluated by end users. Mathematically, iSQoE is a learned function FiSQoEF_{\mathrm{iSQoE}} mapping a rectified stereo pair (I,Ir)(I_\ell, I_r) to a scalar:

iSQoE(I,Ir)=FiSQoE(I,Ir)\mathrm{iSQoE}(I_\ell, I_r) = F_{\mathrm{iSQoE}}(I_\ell, I_r)

where lower values denote higher predicted comfort and better stereo quality as judged by human observers (Behrens et al., 11 Dec 2025, Tamir et al., 30 Dec 2024).

2. Dataset and Human Preference Annotation Protocol

The development and training of iSQoE rely on the SCOPE (Stereoscopic COntent Preference Evaluation) dataset, comprising 2,400 stereoscopic samples with broad coverage of perceptual distortions:

  • Source Distribution: 2,000 samples derive from real-world stereo captures (Holopix50k-HD), and 400 from synthetic multi-view reconstructions rendered via 3D Gaussian splatting.
  • Distortion Coverage: Nineteen distortion types are included, spanning novel-view synthesis artifacts, spatial transformations, photometric deformations, compression, blur, noise, and generative diffusion editing.
  • Annotation Protocol: For each sample, two distorted variants of the same stereo image undergo pairwise VR-based evaluation on an Apple Vision Pro headset. One hundred and three participants provide two-alternative forced choice (2AFC) judgments, with 5 annotators per sample, yielding fine-grained splits for unanimous, strong-majority, and ambiguous preference cases. The dataset is partitioned into 80% train, 10% validation, and 10% test splits (Tamir et al., 30 Dec 2024).

3. Model Architecture and Training Methodology

iSQoE adopts a Siamese vision transformer (ViT, DINOv2 S/14 backbone) architecture with early cross-attention fusion:

  • Dual Input Streams: Left and right images are processed in parallel through transformer stacks.
  • Cross-Attention Fusion: At layers {2,5,8,11}\ell \in \{2,5,8,11\}, key/value tensors from the two views are concatenated, enabling direct binocular interaction.
  • Feature Pooling: At each fusion layer, fused tokens are spatially pooled to produce per-view feature vectors. The final feature vector fR8df \in \mathbb{R}^{8d} is constructed via concatenation.
  • Prediction Head: A two-layer MLP with ReLU nonlinearity and sigmoid output computes the final scalar:

Q(x)=σ(W2ReLU(W1f+b1)+b2)(0,1)Q(x) = \sigma(W_2\,\mathrm{ReLU}(W_1 f + b_1) + b_2) \in (0,1)

Lower QQ values are preferable.

  • Parameter Tuning: Only low-rank (LoRA) adapters within attention layers are trainable; ViT backbone weights remain frozen (Tamir et al., 30 Dec 2024).

Training Objective: The loss is a pairwise hinge ranking function, enforcing that the preferred sample scores lower (better) than the non-preferred, with a margin m0=0.05m_0=0.05:

Lrank=max(0,m0+Q(xm)Q(xn))\mathcal{L}_{\mathrm{rank}} = \max(0, m_0 + Q(x^m) - Q(x^n))

where xmx^m is the preferred sample in each annotated pair.

4. Evaluation Protocol and Practical Usage

In applied stereo evaluation—such as in the StereoSpace monocular-to-stereo framework—iSQoE is employed as a no-leak, end-to-end metric for perceptual quality:

  • Preprocessing: Each generated stereo pair is rectified and center-cropped as appropriate, then resized so the short side matches the iSQoE model’s native resolution (e.g., 512 × 512).
  • Model Inference: The resized images are stacked and passed through the frozen, pretrained FiSQoEF_{\mathrm{iSQoE}} model. The output scalar (without further normalization or scaling) is recorded for each test scene.
  • Reporting: Scene-wise alignment of stereo baselines (or depth scaling) is performed before metric computation, applying equally to all compared methods.
  • Summary Metric: The mean iSQoE across all test scenes constitutes the reported score, with the convention \downarrow (lower is better).
  • Complementarity: iSQoE is coupled with geometry-focused metrics such as MEt3R, enabling complementary assessment: iSQoE penalizes outputs that induce discomfort or poor fusion, even if photometrically plausible, whereas MEt3R targets geometric consistency (Behrens et al., 11 Dec 2025).

Pseudocode for Evaluation:

1
2
3
4
5
6
7
8
for each test scene do
  (I_left, I_right)  M.generate(input_image, baseline, )
  (L, R)  resize_to_512×512(I_left, I_right)
  score  F_iSQoE.predict(L, R)
  record score
end for

report mean_iSQoE = average(score over all test scenes)
Here, F_iSQoE is the frozen network released by Tamir et al. (Behrens et al., 11 Dec 2025).

5. Benchmark Performance and Comparative Analysis

On four real-world datasets (Middlebury 2014, DrivingStereo, Booster, LayeredFlow), iSQoE demonstrates robust discrimination of stereo comfort quality:

  • Mean iSQoE Values:
    • Middlebury 2014: StereoSpace 0.6829, GenStereo 0.6933
    • DrivingStereo: StereoSpace 0.7829, GenStereo 0.7850
    • Booster: StereoSpace 0.6764, GenStereo 0.6901
    • LayeredFlow: StereoSpace 0.7489, GenStereo 0.7678
  • Consistently, StereoSpace achieves mean iSQoE values ≈0.01–0.02 lower than the strongest baselines, interpreted as a significant gain in ease of fusion and viewing comfort (Behrens et al., 11 Dec 2025).
  • Metric Fidelity: On SCOPE’s held-out test set, iSQoE achieves overall human-aligned accuracy of 73.1% (unanimous splits 84.8%), outperforming legacy stereo IQA metrics (StereoQA-Net 68.1%, MANIQA 59.5%, others 50–60%). On mono-to-stereo 2D→3D conversions, iSQoE’s rankings correlate more strongly with VR user preferences than competing metrics (Tamir et al., 30 Dec 2024).

6. Strengths, Limitations, and Future Directions

Strengths:

  • Holistic Comfort Prediction: iSQoE reflects both photometric quality and perceptual comfort, aligned with immersive VR experience.
  • Robust Generalization: The model generalizes to previously unseen distortion types and strengths, as well as to real-world 2D-to-3D conversion outputs.
  • No-Reference Design: Operates directly on stereo pairs without requiring ground truth, depth, or display-specific annotations.

Limitations:

  • Dataset Biases: SCOPE is ~41% Holopix50k-HD, restricting disparity range and potentially biasing comfort priors.
  • Synthetic Coverage: Synthetic NVS subset limited to 13 scenes, not exhaustive across content types.
  • Backbone Biases: DINOv2 pretraining may induce overemphasis on texture or fail to capture fine geometric cues relevant to stereo fusion.
  • Generalization Beyond VR: While VR annotation ensures relevance for head-mounted display use cases, absolute score calibration may be less valid for passive 3D or autostereoscopic displays.

Prospective Improvements: Expanding VR-annotated datasets to enhance disparity/diversity coverage, integrating temporal SQoE modeling for video content, explicitly modeling vergence-accommodation conflicts, and combining discomfort predictions with geometry estimators are identified as key pathways (Tamir et al., 30 Dec 2024).

7. Contextualization Within the Metric Landscape

iSQoE represents the first data-driven, end-to-end, no-reference SQoE predictor specifically optimized on VR headset–based human annotations. It addresses the inherent deficiencies of traditional image quality and stereo discomfort metrics by modeling the downstream human factors that govern immersive visual experience. Its deployment in benchmarking cutting-edge diffusion-based 2D-to-3D or monocular-to-stereo generation pipelines reflects a broader paradigm shift toward perceptually grounded evaluation standards in VR-adaptive content creation (Behrens et al., 11 Dec 2025, Tamir et al., 30 Dec 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to iSQoE.