Semantic Affinity Quality Index (SAQI)

Updated 27 April 2026

SAQI is a zero-shot video quality metric that quantifies the semantic alignment between video content and quality prompts using CLIP models.
It computes differential affinities between positive and negative prompts to capture high-level perceptual cues beyond pixel-level distortions.
SAQI is integrated with classical spatial and temporal indices to create unified metrics like BUONA-VISTA and BVQI, demonstrating robust in-the-wild performance.

The Semantic Affinity Quality Index (SAQI) is an opinion-unaware (zero-shot) video quality assessment metric that quantifies the alignment between a video's visual content and human-interpretable, quality-related semantic prompts via contrastive language-image pre-training (CLIP) models. Unlike traditional no-reference video quality metrics, which focus primarily on technical distortions in pixel or frequency space, SAQI directly models high-level semantic factors—such as “pleasantness” or “aesthetic value”—by estimating how closely a video matches descriptions like “high quality” or “a beautiful landscape” in the shared CLIP embedding space. By fusing SAQI with classical spatial and temporal naturalness metrics, it forms the basis of advanced unified VQA indices (e.g., BUONA-VISTA, BVQI) that achieve robust and high-performing assessment of “in-the-wild” videos, surpassing previous zero-shot methods and rivaling supervised counterparts without training on human opinion scores (Wu et al., 2023, Wu et al., 2023).

1. Mathematical Definition and Computation

SAQI leverages the joint vision-text embedding capabilities of CLIP. Given a video $V$ downsampled to $N$ frames $\{V_i\}_{i=0}^{N-1}$ at a canonical spatial scale (typically $224 \times 224$ pixels), each frame is encoded into a $d$ -dimensional feature $f_{v,i} = E_v(V_i)$ using the CLIP visual encoder. For a given text prompt $T$ , the text encoder yields $f_t^T = E_t(T)$ .

The core of SAQI is the computation of the average cosine similarity between frame and text embeddings:

$A(V, T) = \frac{1}{N} \sum_{i=0}^{N-1} \frac{f_{v,i} \cdot f_t^T}{ \| f_{v,i} \| \| f_t^T \| }$

To reflect quality, a differential affinity is computed between a positive prompt $T_+$ and a negative prompt $N$ 0:

$N$ 1

A positive $N$ 2 indicates that the video content is semantically closer to “good quality” than “bad quality” with respect to the specified prompts.

For each of several prompt pairs, DA is mapped onto $N$ 3 via a sigmoid:

$N$ 4

Empirically, prompts such as (“high quality”, “low quality”) and (“a good photo”, “a bad photo”) are used to capture distinct aspects of perceived quality (Wu et al., 2023).

In alternative formulations (Wu et al., 2023), SAQI is computed as a mean-pooled, prompt-averaged affinity across all frames:

$N$ 5

where $N$ 6 and $N$ 7 are $N$ 8-normalized CLIP text and image vectors, respectively.

2. Prompt Design, Frame Sampling, and CLIP Encoding

Precise selection of text prompts is crucial, as different prompts are sensitive to different quality attributes—for example, “high quality” vs “low quality” is sensitive to fine technical distortions, while “a good photo” vs “a bad photo” reflects overall aesthetic or compositional factors. Multiple prompts can be used and their affinities pooled via mean or summation (Wu et al., 2023, Wu et al., 2023).

Videos are uniformly sampled (e.g., $N$ 9 frames) and resized to match the input scale of CLIP’s pre-training. The visual encoder (ResNet-50 backbone in CLIP) processes each frame to generate embedding vectors, while textual prompts are encoded via CLIP’s text transformer. All feature vectors are normalized to ensure cosine similarities reside in $\{V_i\}_{i=0}^{N-1}$ 0.

The method is resilient to spatial resolution, as semantic content and global composition persist under downsampling.

3. Localized SAQI (SAQI-Local)

SAQI-Local extends the global affinity measure to spatial subregions within each frame. Patch-level features are extracted from intermediate layers of the CLIP visual encoder ( $\{V_i\}_{i=0}^{N-1}$ 1), enabling patch-prompt affinities:

$\{V_i\}_{i=0}^{N-1}$ 2

For each frame, prompt-wise averages are computed, and then spatial maxima (or means) are taken to aggregate semantic salience:

$\{V_i\}_{i=0}^{N-1}$ 3

The overall video-level score is the temporal average:

$\{V_i\}_{i=0}^{N-1}$ 4

This local version facilitates sensitivity to semantically rich regions, such as a “beautiful subject” present in only part of a frame (Wu et al., 2023).

4. Aggregation with Low-Level Quality Indices

SAQI is designed to capture high-level semantic factors and is typically aggregated with established technical indices—specifically, spatial (NIQE) and temporal (TPQI or TLVQM) no-reference quality metrics—into a single unified quality index.

Aggregation workflow:

NIQE and TPQI raw scores are first normalized by subtracting their empirical means and dividing by standard deviation (Gaussian normalization). As these indices assign lower values to higher quality, a sigmoid rescaling (e.g., $\{V_i\}_{i=0}^{N-1}$ 5) maps scores onto $\{V_i\}_{i=0}^{N-1}$ 6 with the correct orientation (Wu et al., 2023).
Each index is thus re-mapped into $\{V_i\}_{i=0}^{N-1}$ 7, with higher values indicating better quality.
Fusion for the BUONA-VISTA index is simple additive:

$\{V_i\}_{i=0}^{N-1}$ 8

where $\{V_i\}_{i=0}^{N-1}$ 9 is SAQI, $224 \times 224$ 0 is the remapped NIQE (spatial index), and $224 \times 224$ 1 is the remapped TPQI (temporal index). The final score spans $224 \times 224$ 2 and weights all factors equally. In the alternative BVQI, a weighted linear combination with tuned weights is used (Wu et al., 2023).

5. Robustness, Fine-Tuning, and Model Variants

The zero-shot nature of SAQI, BUONA-VISTA, and BVQI means they do not rely on mean opinion scores (MOS) or subjective database-specific labels, directly addressing limitations of supervised VQA methods with respect to generalization and cost.

However, BVQI-Local can be further fine-tuned on a small, labeled validation set by:

Jointly optimizing (a) prompt token embeddings and (b) fusion weights and sigmoid parameters $224 \times 224$ 3 via MSE loss between predicted BVQI-Local scores and MOS, using stochastic gradient descent (Adam optimizer) (Wu et al., 2023).
This retains the efficiency and generalizability of the zero-shot index while accommodating domain- or dataset-specific semantic distributions.

6. Empirical Validation and Comparative Performance

On “in-the-wild” VQA datasets (LIVE-VQC, KoNViD-1k, YouTube-UGC, CVD2014), aggregation indices incorporating SAQI demonstrate state-of-the-art zero-shot performance:

BUONA-VISTA achieves Spearman rank (SRCC) and Pearson linear correlation coefficients (PLCC) improvements of at least 20% (and up to 80%) relative to previous best opinion-unaware methods. For example, on KoNViD-1k, SRCC improves from 0.556 (TPQI) to 0.760 (BUONA-VISTA), and on YouTube-UGC, from $224 \times 224$ 40.29 to 0.525 (Wu et al., 2023).
BVQI (with SAQI) achieves SRCC ≈ 0.72 on KoNViD-1k (compared to 0.64 for TLVQM) and ≈ 0.76 on LIVE-VQC (compared to ≈0.59), with further fine-tuning via BVQI-Local boosting scores as high as 0.80 (Wu et al., 2023).
The semantic branch contributes significant performance gains, particularly on datasets dominated by aesthetic or authentic distortions; ablation reduces SRCC by 0.05–0.08.

Cross-dataset testing indicates superior robustness, with no degradation when compared to domain-specific, supervised methods—whose generalization degrades when tested outside the training distribution.

7. Advantages, Limitations, and Applications

By leveraging text-prompted semantics in a feed-forward CLIP architecture, SAQI and its variants:

Accurately capture contributions to visual quality from high-level semantic factors, highly correlated with human perception in naturalistic video scenarios.
Operate without dependence on human opinion scores or retraining, enabling plug-and-play deployment for quality monitoring, camera benchmarking, and perceptually tuned streaming.
Are limited by the domain scope of CLIP; scenarios with unusual modalities or content outside CLIP’s training distribution (e.g., animation, niche topics) can yield degraded correspondence.
Incurs greater computational overhead than hand-crafted statistical indices, as CLIP encoding must be performed for the sampled frames (and patches, for localized variants), trading efficiency for increased semantic sensitivity.

SAQI, as part of aggregated indices like BUONA-VISTA and BVQI, enables high-fidelity, dataset-robust, zero-shot video quality assessment, bridging the semantic gap in traditional technical VQA and approaching the reliability of opinion-driven VQA without requiring human annotation (Wu et al., 2023, Wu et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Exploring Opinion-unaware Video Quality Assessment with Semantic Affinity Criterion (2023)

Towards Robust Text-Prompted Semantic Criterion for In-the-Wild Video Quality Assessment (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semantic Affinity Quality Index (SAQI).