Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semantic Affinity Quality Index (SAQI)

Updated 27 April 2026
  • SAQI is a zero-shot video quality metric that quantifies the semantic alignment between video content and quality prompts using CLIP models.
  • It computes differential affinities between positive and negative prompts to capture high-level perceptual cues beyond pixel-level distortions.
  • SAQI is integrated with classical spatial and temporal indices to create unified metrics like BUONA-VISTA and BVQI, demonstrating robust in-the-wild performance.

The Semantic Affinity Quality Index (SAQI) is an opinion-unaware (zero-shot) video quality assessment metric that quantifies the alignment between a video's visual content and human-interpretable, quality-related semantic prompts via contrastive language-image pre-training (CLIP) models. Unlike traditional no-reference video quality metrics, which focus primarily on technical distortions in pixel or frequency space, SAQI directly models high-level semantic factors—such as “pleasantness” or “aesthetic value”—by estimating how closely a video matches descriptions like “high quality” or “a beautiful landscape” in the shared CLIP embedding space. By fusing SAQI with classical spatial and temporal naturalness metrics, it forms the basis of advanced unified VQA indices (e.g., BUONA-VISTA, BVQI) that achieve robust and high-performing assessment of “in-the-wild” videos, surpassing previous zero-shot methods and rivaling supervised counterparts without training on human opinion scores (Wu et al., 2023, Wu et al., 2023).

1. Mathematical Definition and Computation

SAQI leverages the joint vision-text embedding capabilities of CLIP. Given a video VV downsampled to NN frames {Vi}i=0N1\{V_i\}_{i=0}^{N-1} at a canonical spatial scale (typically 224×224224 \times 224 pixels), each frame is encoded into a dd-dimensional feature fv,i=Ev(Vi)f_{v,i} = E_v(V_i) using the CLIP visual encoder. For a given text prompt TT, the text encoder yields ftT=Et(T)f_t^T = E_t(T).

The core of SAQI is the computation of the average cosine similarity between frame and text embeddings:

A(V,T)=1Ni=0N1fv,iftTfv,iftTA(V, T) = \frac{1}{N} \sum_{i=0}^{N-1} \frac{f_{v,i} \cdot f_t^T}{ \| f_{v,i} \| \| f_t^T \| }

To reflect quality, a differential affinity is computed between a positive prompt T+T_+ and a negative prompt NN0:

NN1

A positive NN2 indicates that the video content is semantically closer to “good quality” than “bad quality” with respect to the specified prompts.

For each of several prompt pairs, DA is mapped onto NN3 via a sigmoid:

NN4

Empirically, prompts such as (“high quality”, “low quality”) and (“a good photo”, “a bad photo”) are used to capture distinct aspects of perceived quality (Wu et al., 2023).

In alternative formulations (Wu et al., 2023), SAQI is computed as a mean-pooled, prompt-averaged affinity across all frames:

NN5

where NN6 and NN7 are NN8-normalized CLIP text and image vectors, respectively.

2. Prompt Design, Frame Sampling, and CLIP Encoding

Precise selection of text prompts is crucial, as different prompts are sensitive to different quality attributes—for example, “high quality” vs “low quality” is sensitive to fine technical distortions, while “a good photo” vs “a bad photo” reflects overall aesthetic or compositional factors. Multiple prompts can be used and their affinities pooled via mean or summation (Wu et al., 2023, Wu et al., 2023).

Videos are uniformly sampled (e.g., NN9 frames) and resized to match the input scale of CLIP’s pre-training. The visual encoder (ResNet-50 backbone in CLIP) processes each frame to generate embedding vectors, while textual prompts are encoded via CLIP’s text transformer. All feature vectors are normalized to ensure cosine similarities reside in {Vi}i=0N1\{V_i\}_{i=0}^{N-1}0.

The method is resilient to spatial resolution, as semantic content and global composition persist under downsampling.

3. Localized SAQI (SAQI-Local)

SAQI-Local extends the global affinity measure to spatial subregions within each frame. Patch-level features are extracted from intermediate layers of the CLIP visual encoder ({Vi}i=0N1\{V_i\}_{i=0}^{N-1}1), enabling patch-prompt affinities:

{Vi}i=0N1\{V_i\}_{i=0}^{N-1}2

For each frame, prompt-wise averages are computed, and then spatial maxima (or means) are taken to aggregate semantic salience:

{Vi}i=0N1\{V_i\}_{i=0}^{N-1}3

The overall video-level score is the temporal average:

{Vi}i=0N1\{V_i\}_{i=0}^{N-1}4

This local version facilitates sensitivity to semantically rich regions, such as a “beautiful subject” present in only part of a frame (Wu et al., 2023).

4. Aggregation with Low-Level Quality Indices

SAQI is designed to capture high-level semantic factors and is typically aggregated with established technical indices—specifically, spatial (NIQE) and temporal (TPQI or TLVQM) no-reference quality metrics—into a single unified quality index.

Aggregation workflow:

  • NIQE and TPQI raw scores are first normalized by subtracting their empirical means and dividing by standard deviation (Gaussian normalization). As these indices assign lower values to higher quality, a sigmoid rescaling (e.g., {Vi}i=0N1\{V_i\}_{i=0}^{N-1}5) maps scores onto {Vi}i=0N1\{V_i\}_{i=0}^{N-1}6 with the correct orientation (Wu et al., 2023).
  • Each index is thus re-mapped into {Vi}i=0N1\{V_i\}_{i=0}^{N-1}7, with higher values indicating better quality.
  • Fusion for the BUONA-VISTA index is simple additive:

{Vi}i=0N1\{V_i\}_{i=0}^{N-1}8

where {Vi}i=0N1\{V_i\}_{i=0}^{N-1}9 is SAQI, 224×224224 \times 2240 is the remapped NIQE (spatial index), and 224×224224 \times 2241 is the remapped TPQI (temporal index). The final score spans 224×224224 \times 2242 and weights all factors equally. In the alternative BVQI, a weighted linear combination with tuned weights is used (Wu et al., 2023).

5. Robustness, Fine-Tuning, and Model Variants

The zero-shot nature of SAQI, BUONA-VISTA, and BVQI means they do not rely on mean opinion scores (MOS) or subjective database-specific labels, directly addressing limitations of supervised VQA methods with respect to generalization and cost.

However, BVQI-Local can be further fine-tuned on a small, labeled validation set by:

  • Jointly optimizing (a) prompt token embeddings and (b) fusion weights and sigmoid parameters 224×224224 \times 2243 via MSE loss between predicted BVQI-Local scores and MOS, using stochastic gradient descent (Adam optimizer) (Wu et al., 2023).
  • This retains the efficiency and generalizability of the zero-shot index while accommodating domain- or dataset-specific semantic distributions.

6. Empirical Validation and Comparative Performance

On “in-the-wild” VQA datasets (LIVE-VQC, KoNViD-1k, YouTube-UGC, CVD2014), aggregation indices incorporating SAQI demonstrate state-of-the-art zero-shot performance:

  • BUONA-VISTA achieves Spearman rank (SRCC) and Pearson linear correlation coefficients (PLCC) improvements of at least 20% (and up to 80%) relative to previous best opinion-unaware methods. For example, on KoNViD-1k, SRCC improves from 0.556 (TPQI) to 0.760 (BUONA-VISTA), and on YouTube-UGC, from 224×224224 \times 22440.29 to 0.525 (Wu et al., 2023).
  • BVQI (with SAQI) achieves SRCC ≈ 0.72 on KoNViD-1k (compared to 0.64 for TLVQM) and ≈ 0.76 on LIVE-VQC (compared to ≈0.59), with further fine-tuning via BVQI-Local boosting scores as high as 0.80 (Wu et al., 2023).
  • The semantic branch contributes significant performance gains, particularly on datasets dominated by aesthetic or authentic distortions; ablation reduces SRCC by 0.05–0.08.

Cross-dataset testing indicates superior robustness, with no degradation when compared to domain-specific, supervised methods—whose generalization degrades when tested outside the training distribution.

7. Advantages, Limitations, and Applications

By leveraging text-prompted semantics in a feed-forward CLIP architecture, SAQI and its variants:

  • Accurately capture contributions to visual quality from high-level semantic factors, highly correlated with human perception in naturalistic video scenarios.
  • Operate without dependence on human opinion scores or retraining, enabling plug-and-play deployment for quality monitoring, camera benchmarking, and perceptually tuned streaming.
  • Are limited by the domain scope of CLIP; scenarios with unusual modalities or content outside CLIP’s training distribution (e.g., animation, niche topics) can yield degraded correspondence.
  • Incurs greater computational overhead than hand-crafted statistical indices, as CLIP encoding must be performed for the sampled frames (and patches, for localized variants), trading efficiency for increased semantic sensitivity.

SAQI, as part of aggregated indices like BUONA-VISTA and BVQI, enables high-fidelity, dataset-robust, zero-shot video quality assessment, bridging the semantic gap in traditional technical VQA and approaching the reliability of opinion-driven VQA without requiring human annotation (Wu et al., 2023, Wu et al., 2023).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Affinity Quality Index (SAQI).