Papers
Topics
Authors
Recent
Search
2000 character limit reached

DAGR-VQA: Global Registers for Video Quality

Updated 23 January 2026
  • The paper introduces global register-tokens that inject scene-level context into a 3D-UNet, enabling dynamic attention for temporally adaptive saliency maps.
  • It integrates a lightweight temporal transformer with convex saliency fusion to efficiently predict mean opinion scores in real-time.
  • Comprehensive ablations confirm that dynamic attention and register-tokens significantly boost saliency metrics and overall VQA performance compared to static baselines.

Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA) is a no-reference video quality assessment framework that introduces learnable global register-tokens directly into a convolutional backbone, enabling dynamic, spatio-temporal attention mechanisms inspired by the human visual system (HVS). DAGR-VQA produces temporally adaptive saliency maps, integrates them with raw video frames, and uses a lightweight @@@@1@@@@ to deliver perceptually consistent, state-of-the-art video quality predictions at real-time inference speeds (Mithila et al., 16 Jan 2026).

1. Architectural Foundations

The core architecture of DAGR-VQA is built on a 3D-UNet encoder–decoder structure that operates on video clips comprising TT frames, each of spatial size H×WH \times W and C=3C=3 RGB channels. A principal innovation is the injection of NN learnable register-tokens at the very first convolutional layer. These tokens serve as global context carriers, fundamentally integrating scene-level priors within the convolutional feature extraction process.

Formally, let NN denote the number of register-tokens (default N=4N=4), each of embedding dimension dd matching the initial channel width. The register-token tensor RR1×N×d×1×1R \in \mathbb{R}^{1\times N \times d \times 1 \times 1} is initialized as RN(0,1)R \sim \mathcal{N}(0,1) and broadcast via a learned 3D convolution to align with the input video volume:

R=T(R)=Conv3D(R)RN×T×H×WR' = T(R) = \mathrm{Conv3D}(R) \in \mathbb{R}^{N \times T \times H \times W}

The input is then augmented as

Vaug=Concat[V,R]V_{\text{aug}} = \text{Concat}[V, R']

with VV the original video tensor. The subsequent convolutional layers operate over both pixel and global register-token channels, embedding dynamic scene context at each hierarchical layer.

2. Dynamic Attention and Saliency Prediction

After feature encoding by the UNet3D encoder E()E(\cdot), a bottleneck representation Z=E(Vaug)Z = E(V_{\text{aug}}) is obtained. An attention mask A(Z)=σ(Conv3D(Z))A(Z) = \sigma(\mathrm{Conv3D}(Z)), where σ\sigma is a sigmoid function, gates the encoded features through element-wise multiplication with a small 3D convolutional block B(Z)B(Z). This yields refined bottleneck features:

Z=B(Z)A(Z)Z' = B(Z) \odot A(Z)

The decoder D()D(\cdot) reconstructs per-frame dynamic saliency maps as

S^=D(Z),St=S^[:,t,:,:],t=1,,T\hat{S} = D(Z'), \quad S_t = \hat{S}[:,t,:,:], \quad t=1,\dots,T

Here, the global register-tokens, through gradient-based training, serve as a compact 'scene memory' that is broadcast through each forward pass, dynamically biasing saliency computation without explicit motion estimation.

3. Saliency Integration and Video Quality Assessment Pipeline

Saliency prediction is integrated within the overall video quality assessment pipeline via spatial fusion and temporal regression:

  • For each input frame ItR3×H×WI_t \in \mathbb{R}^{3\times H \times W}, the predicted saliency StS_t is fused using a convex combination:

Fraw=(1α)It+α(ItSt),α[0,1]F_{\text{raw}} = (1-\alpha) I_t + \alpha (I_t \odot S_t), \qquad \alpha \in [0,1]

  • Spatial features are extracted by a ResNet-50 backbone fφf_\varphi with global average pooling (GAP):

Ft=GAP[fφ(Fraw)]RdF_t = \mathrm{GAP}[f_\varphi(F_{\text{raw}})] \in \mathbb{R}^d

  • Temporal structure is encoded by adding sinusoidal positional encodings PtP_t and passing F~t=Ft+Pt\tilde F_t = F_t + P_t through L=2L=2 temporal transformer encoder layers:

Zt=MHSA(F~t,{F~}1T)+F~tZ_t = \mathrm{MHSA}(\tilde F_t, \{\tilde F\}_{1}^{T}) + \tilde F_t

Yt=LN[FFN(Zt)+Zt]Y_t = \mathrm{LN}[\mathrm{FFN}(Z_t) + Z_t]

  • The aggregate spatial and transformer features,

ys=1Tt=1TFt,yt=1Tt=1TYty_s = \frac{1}{T} \sum_{t=1}^T F_t, \quad y_t = \frac{1}{T} \sum_{t=1}^T Y_t

are concatenated and regressed to the mean opinion score (MOS):

y^=fω([ys;yt])\hat{y} = f_\omega([y_s; y_t])

4. Supervised Training Protocols

4.1. Saliency Pre-training

Saliency predictors are pre-trained on ground-truth maps using a combined Kullback–Leibler (KL) divergence and Pearson correlation objective:

Lsal=γLKL+LCC,γ=0.01L_\text{sal} = \gamma L_\text{KL} + L_\text{CC}, \quad \gamma = 0.01

where

LKL=1NjSjlogSjS^jL_\text{KL} = \frac{1}{N} \sum_j S_j \log \frac{S_j}{\hat{S}_j}

LCC=j(S^jμS^)(SjμS)j(S^jμS^)2j(SjμS)2L_\text{CC} = - \frac{\sum_j (\hat{S}_j - \mu_{\hat{S}})(S_j - \mu_S)}{\sqrt{\sum_j (\hat{S}_j - \mu_{\hat{S}})^2 \cdot \sum_j (S_j - \mu_S)^2}}

The optimizer is Adam with learning rate 5×1035 \times 10^{-3} and batch size $4$ for $180$ epochs on DHF1K.

4.2. VQA Fine-tuning

Fine-tuning for MOS prediction employs a loss combining 1\ell_1 regression and Spearman rank correlation:

LVQA=L1+βLcorr,β=0.1L_\text{VQA} = L_1 + \beta L_\text{corr}, \qquad \beta = 0.1

where

L1=y^y,Lcorr=1ρ(y^,y)L_1 = |\hat{y} - y|, \qquad L_\text{corr} = 1 - \rho(\hat{y}, y)

Optimization uses Adam with learning rate 1×1051\times 10^{-5} (cosine annealing), batch size $5$, and $300$ training epochs. The fusion weight α\alpha is fixed at $0.5$.

5. Empirical Performance and Computational Analysis

DAGR-VQA demonstrates state-of-the-art accuracy on four large-scale user-generated content (UGC) benchmarks, substantially outperforming static-attention and non-saliency baselines. The following table summarizes performance (PLCC/SRCC):

Dataset PLCC SRCC
LSVQ 0.892 0.907
KonVid-1k 0.863 0.896
LIVE-VQC 0.915 0.886
YouTube-UGC 0.913 0.910
Average 0.896 0.900

Computational efficiency is a key strength, with DAGR-VQA requiring $59$ GFLOPs for an eight-frame 224×398224 \times 398 clip (versus $141$ GFLOPs for ViViT) and achieving $387.7$ FPS at $1080p$ resolution (RTX A5000, $2400$-frame normalization). The complexity scales as O(TNd+T2d)\mathcal{O}(T N d + T^2 d), with spatial terms dominating for TNT \ll N.

6. Ablations and Mechanistic Insights

Ablation studies empirically validate the contributions of register-tokens and dynamic saliency mechanisms:

  • Removing register-tokens (i.e., static attention only) degrades saliency metrics (e.g., NSS drops from $3.683$ to $2.945$, CC from $0.704$ to $0.640$, AUC-J from $0.942$ to $0.904$).
  • Adding N=4N=4 tokens yields up to 25%25\% relative improvement in standard saliency scores.
  • For VQA, the dynamic saliency + register-token configuration consistently exceeds both static and non-saliency models across all SRCC values.
  • Qualitative tracking shows temporally consistent saliency following moving objects.
  • In cross-database transfer (e.g., LSVQ\rightarrowKoNViD), DAGR-VQA achieves the highest median SRCC among five strong baselines, indicating robust generalization.
  • Performance peaks at N=4N=4 register-tokens; excessive tokens yield diminishing returns.
  • Varying the fusion weight α\alpha in the spatial combination step, α=0.5\alpha = 0.5 offers optimal accuracy.

7. Significance and Comparative Context

DAGR-VQA is the first no-reference VQA method to embed register-tokens as global memory directly into convolutional feature extraction, rather than using saliency maps as auxiliary static inputs. This approach enables fast, HVS-inspired, temporally adaptive saliency prediction and provides a unified pipeline from spatio-temporal feature extraction to transformer-based quality regression. The method achieves $\sim\$0.900$ SRCC average on four benchmarks, with computational demands suitable for real-time deployment in multimedia streaming contexts (Mithila et al., 16 Jan 2026). A plausible implication is that the register-token approach may generalize to other vision tasks requiring stable, adaptive attention mechanisms across temporal sequences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA).