Encoding, Spatial & Temporal Quality Scores

Updated 8 September 2025

Encoding, spatial, and temporal scores are metrics that quantify video quality by evaluating spatial resolution, frame rate, and quantization effects.
The QSTAR model uses inverse-exponential functions to merge the three dimensions, achieving high prediction accuracy against subjective scores.
The framework guides video quality optimization by balancing codec settings, reducing parameter complexity, and enabling scalable streaming adaptation.

Encoding, spatial, and temporal scores are central components in the objective and subjective analysis of video quality, video compression, and video-based perceptual evaluation frameworks. These scores quantify the effects of encoding parameters—principally spatial resolution, temporal resolution, and quantization (or amplitude) resolution—on the perceived quality of video, and underpin models used in rate-quality optimization and scalable video adaptation.

1. STAR Dimensions: Definitions and Quantitative Characterization

Encoding in video systems fundamentally refers to the transformation of raw video data via spatial downsampling, temporal sub-sampling, and quantization. The STAR model [Editor's term for “Spatial, Temporal, Amplitude Resolution”] provides a comprehensive abstraction by linking three orthogonal dimensions:

Spatial Resolution (SR): Number of pixels per frame ( $s$ ), typically parameterized as a fraction $s/s_{max}$ with $s_{max}$ the highest available resolution.
Temporal Resolution (TR): Frame rate ( $t$ ), represented as $t/t_{max}$ where $t_{max}$ is the reference maximum frame rate.
Quantization Stepsize (QS): Amplitude resolution controlled via encoder quantization parameter ( $q$ ), related to QP in H.264 as $QP(q) = 4 + 6\log_2 q$ .

Normalized objective or subjective perceptual quality (commonly denoted MOS, mean opinion score) is modeled as the product of three functions, each reflecting the drop in perceived quality caused by decreasing one STAR dimension while holding others constant. These functions are empirically fit to normalized subjective data.

2. Q-STAR Model Formulation and Parameterization

The Q-STAR model (Ou et al., 2012) defines normalized quality factors for each dimension (here expressed for spatial, but similar for temporal and quantization):

$\textrm{NQS}(s; t, q) = \frac{\textrm{MOS}(s, t, q)}{\textrm{MOS}(s_{\max}, t, q)}$

A generalized “inverse exponential” form is used for each, e.g., for spatial:

$\textrm{MNQS}(s; q) = \frac{1 - \exp\left(-\alpha_s(q) \cdot (s/s_{max})^{\beta_s}\right)}{1 - \exp(-\alpha_s(q))}$

$\beta_s$ is a fixed shaping parameter (empirically $\beta_s=0.74$ ).
$\alpha_s(q)$ is a decay parameter, linearly dependent on quantization parameter $QP$ above a certain threshold ( $QP\geq28$ ), $\alpha_s(q) = \hat{\alpha}_s \cdot (\nu_1 QP + \nu_2)$ where $\nu_1 = -0.037$ , $\nu_2=2.25$ .

Similarly, the temporal quality term ( $\textrm{MNQT}(t)$ ) and quantization term ( $\textrm{MNQQ}(q;s)$ ) use analogous forms, with their own decay and shaping parameters.

The complete QSTAR model for predicted video quality is:

$\textrm{QSTAR}(s, t, q) = \textrm{MNQQ}(q; s_{max}) \cdot \textrm{MNQS}(s; q) \cdot \textrm{MNQT}(t)$

with only three content-dependent parameters ( $\hat{\alpha}_s$ , $\alpha_q$ , $\alpha_t$ ) and empirically fixed shape parameters.

3. Independence and Interaction of Quality Decay Rates

A critical empirical finding is that, to first order, the decay rates with respect to each STAR axis are mutually independent:

The spatial ( $\alpha_s$ ) and quantization ( $\alpha_q$ ) decay rates do not depend on temporal resolution.
The temporal decay rate ( $\alpha_t$ ) is independent of both spatial and quantization axes.

However, the decay rate of spatial quality ( $\alpha_s$ ) does depend on QS, introducing an interaction: the degradation in perceived quality with increased quantization becomes more severe at lower spatial resolutions. This is notably prominent for codecs like H.264/SVC, where deblocking filters can turn high-QS artifacts into blurring that is especially objectionable at low resolutions.

Parameter independence, validated by ANOVA across multiple datasets and configuration sweeps, substantially reduces model complexity: only three parameters need to be fit per content, making the framework operationally efficient.

4. Experimental Validation and Model Accuracy

The QSTAR model was robustly validated using combinations of seven source sequences, spanning 3 spatial resolutions, 3 temporal resolutions, and 3 QP levels—yielding 189 processed video sequences (PVSs)—as well as five additional independent datasets. Performance metrics include:

Pearson correlation coefficient (PCC) with subjective MOS: $0.991$ (highly accurate).
Root mean squared error (RMSE) of model predictions with respect to observer scores is acceptably low across all datasets.
Individual sub-models (MNQS, MNQT, MNQQ) maintained high PCC ( $\geq0.85$ ) and low RMSE even when only two STAR axes varied.

Statistical rigor was ensured via a two-step observer consistency refinement (removal of outliers, confirmation of normality) and analysis of confidence intervals for all model-data correlations.

5. Trade-offs and Encoder/Adaptation Guidance

The Q-STAR model provides practical guidance for system- and codec-level decision making by quantifying trade-offs:

QS/Spatial Interaction: Lowering both spatial resolution and increasing quantization should be avoided—losses are supra-additive due to the interaction effect.
Parameter Reduction: Only three content-dependent parameters per source clip are required, streamlining real-time adaptation.
Rate–Quality Optimization: For any target bitrate, an optimal configuration of $(s,t, q)$ can be computed to maximize perceptual quality, given the sharply defined model.
Scalability: The functional form accommodates both scalable video coding (e.g. H.264/SVC) and single-layer streams.

In practice, the model allows joint adaptation strategies—beyond single-axis scaling—by exploiting the near-independence of temporal, spatial, and quantization decay rates.

6. Model Limitations and Extensibility

The QSTAR model, while highly accurate and robust for perceptual video quality on mobile and standard displays, exhibits limitations inherent to parametric, content-dependent modeling:

It may require per-content parameter estimation, especially for extreme or highly dynamic video content outside the span of training data.
The interaction between spatial and quantization axes, while captured linearly in QP above threshold, may be nonlinear in edge cases (e.g., extreme compression artifacts).
The model presumes that perceptual impact of each STAR axis is well characterized by the inverse-exponential form; this may not generalize to highly sophisticated codecs employing advanced error resilience or temporal filtering.

Nevertheless, the core approach—separate, invertible, low-parameter functional fits for each major quality axis—remains adaptable as perceptual coding and display technologies evolve.

7. Implications for Video Quality Assessment and Standardization

By decomposing perceptual quality into encoding, spatial, and temporal components, and providing a rigorous data-driven framework for their interaction, the QSTAR model offers a bridge between low-level encoder design, user experience measurement, and rate-distortion theory. The strong statistical validation across multiple datasets, the practical parameterizability, and the high correlation with human scores, position it as a foundation for adaptive streaming optimization and future objective video quality metric design.

Markdown Report Issue Upgrade to Chat

References (1)

Q-STAR:A Perceptual Video Quality Model Considering Impact of Spatial, Temporal, and Amplitude Resolutions (2012)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Encoding, Spatial, and Temporal Scores.