DAGR-VQA: Global Registers for Video Quality

Updated 23 January 2026

The paper introduces global register-tokens that inject scene-level context into a 3D-UNet, enabling dynamic attention for temporally adaptive saliency maps.
It integrates a lightweight temporal transformer with convex saliency fusion to efficiently predict mean opinion scores in real-time.
Comprehensive ablations confirm that dynamic attention and register-tokens significantly boost saliency metrics and overall VQA performance compared to static baselines.

Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA) is a no-reference video quality assessment framework that introduces learnable global register-tokens directly into a convolutional backbone, enabling dynamic, spatio-temporal attention mechanisms inspired by the human visual system (HVS). DAGR-VQA produces temporally adaptive saliency maps, integrates them with raw video frames, and uses a lightweight @@@@1@@@@ to deliver perceptually consistent, state-of-the-art video quality predictions at real-time inference speeds (Mithila et al., 16 Jan 2026).

1. Architectural Foundations

The core architecture of DAGR-VQA is built on a 3D-UNet encoder–decoder structure that operates on video clips comprising $T$ frames, each of spatial size $H \times W$ and $C=3$ RGB channels. A principal innovation is the injection of $N$ learnable register-tokens at the very first convolutional layer. These tokens serve as global context carriers, fundamentally integrating scene-level priors within the convolutional feature extraction process.

Formally, let $N$ denote the number of register-tokens (default $N=4$ ), each of embedding dimension $d$ matching the initial channel width. The register-token tensor $R \in \mathbb{R}^{1\times N \times d \times 1 \times 1}$ is initialized as $R \sim \mathcal{N}(0,1)$ and broadcast via a learned 3D convolution to align with the input video volume:

$R' = T(R) = \mathrm{Conv3D}(R) \in \mathbb{R}^{N \times T \times H \times W}$

The input is then augmented as

$V_{\text{aug}} = \text{Concat}[V, R']$

with $V$ the original video tensor. The subsequent convolutional layers operate over both pixel and global register-token channels, embedding dynamic scene context at each hierarchical layer.

2. Dynamic Attention and Saliency Prediction

After feature encoding by the UNet3D encoder $E(\cdot)$ , a bottleneck representation $Z = E(V_{\text{aug}})$ is obtained. An attention mask $A(Z) = \sigma(\mathrm{Conv3D}(Z))$ , where $\sigma$ is a sigmoid function, gates the encoded features through element-wise multiplication with a small 3D convolutional block $B(Z)$ . This yields refined bottleneck features:

$Z' = B(Z) \odot A(Z)$

The decoder $D(\cdot)$ reconstructs per-frame dynamic saliency maps as

$\hat{S} = D(Z'), \quad S_t = \hat{S}[:,t,:,:], \quad t=1,\dots,T$

Here, the global register-tokens, through gradient-based training, serve as a compact 'scene memory' that is broadcast through each forward pass, dynamically biasing saliency computation without explicit motion estimation.

3. Saliency Integration and Video Quality Assessment Pipeline

Saliency prediction is integrated within the overall video quality assessment pipeline via spatial fusion and temporal regression:

For each input frame $I_t \in \mathbb{R}^{3\times H \times W}$ , the predicted saliency $S_t$ is fused using a convex combination:

$F_{\text{raw}} = (1-\alpha) I_t + \alpha (I_t \odot S_t), \qquad \alpha \in [0,1]$

Spatial features are extracted by a ResNet-50 backbone $f_\varphi$ with global average pooling (GAP):

$F_t = \mathrm{GAP}[f_\varphi(F_{\text{raw}})] \in \mathbb{R}^d$

Temporal structure is encoded by adding sinusoidal positional encodings $P_t$ and passing $\tilde F_t = F_t + P_t$ through $L=2$ temporal transformer encoder layers:

$Z_t = \mathrm{MHSA}(\tilde F_t, \{\tilde F\}_{1}^{T}) + \tilde F_t$

$Y_t = \mathrm{LN}[\mathrm{FFN}(Z_t) + Z_t]$

The aggregate spatial and transformer features,

$y_s = \frac{1}{T} \sum_{t=1}^T F_t, \quad y_t = \frac{1}{T} \sum_{t=1}^T Y_t$

are concatenated and regressed to the mean opinion score (MOS):

$\hat{y} = f_\omega([y_s; y_t])$

4. Supervised Training Protocols

4.1. Saliency Pre-training

Saliency predictors are pre-trained on ground-truth maps using a combined Kullback–Leibler (KL) divergence and Pearson correlation objective:

$L_\text{sal} = \gamma L_\text{KL} + L_\text{CC}, \quad \gamma = 0.01$

where

$L_\text{KL} = \frac{1}{N} \sum_j S_j \log \frac{S_j}{\hat{S}_j}$

$L_\text{CC} = - \frac{\sum_j (\hat{S}_j - \mu_{\hat{S}})(S_j - \mu_S)}{\sqrt{\sum_j (\hat{S}_j - \mu_{\hat{S}})^2 \cdot \sum_j (S_j - \mu_S)^2}}$

The optimizer is Adam with learning rate $5 \times 10^{-3}$ and batch size $4$ for $180$ epochs on DHF1K.

4.2. VQA Fine-tuning

Fine-tuning for MOS prediction employs a loss combining $\ell_1$ regression and Spearman rank correlation:

$L_\text{VQA} = L_1 + \beta L_\text{corr}, \qquad \beta = 0.1$

where

$L_1 = |\hat{y} - y|, \qquad L_\text{corr} = 1 - \rho(\hat{y}, y)$

Optimization uses Adam with learning rate $1\times 10^{-5}$ (cosine annealing), batch size $5$, and $300$ training epochs. The fusion weight $\alpha$ is fixed at $0.5$.

5. Empirical Performance and Computational Analysis

DAGR-VQA demonstrates state-of-the-art accuracy on four large-scale user-generated content (UGC) benchmarks, substantially outperforming static-attention and non-saliency baselines. The following table summarizes performance (PLCC/SRCC):

Dataset	PLCC	SRCC
LSVQ	0.892	0.907
KonVid-1k	0.863	0.896
LIVE-VQC	0.915	0.886
YouTube-UGC	0.913	0.910
Average	0.896	0.900

Computational efficiency is a key strength, with DAGR-VQA requiring $59$ GFLOPs for an eight-frame $224 \times 398$ clip (versus $141$ GFLOPs for ViViT) and achieving $387.7$ FPS at $1080p$ resolution (RTX A5000, $2400$-frame normalization). The complexity scales as $\mathcal{O}(T N d + T^2 d)$ , with spatial terms dominating for $T \ll N$ .

6. Ablations and Mechanistic Insights

Ablation studies empirically validate the contributions of register-tokens and dynamic saliency mechanisms:

Removing register-tokens (i.e., static attention only) degrades saliency metrics (e.g., NSS drops from $3.683$ to $2.945$, CC from $0.704$ to $0.640$, AUC-J from $0.942$ to $0.904$).
Adding $N=4$ tokens yields up to $25\%$ relative improvement in standard saliency scores.
For VQA, the dynamic saliency + register-token configuration consistently exceeds both static and non-saliency models across all SRCC values.
Qualitative tracking shows temporally consistent saliency following moving objects.
In cross-database transfer (e.g., LSVQ $\rightarrow$ KoNViD), DAGR-VQA achieves the highest median SRCC among five strong baselines, indicating robust generalization.
Performance peaks at $N=4$ register-tokens; excessive tokens yield diminishing returns.
Varying the fusion weight $\alpha$ in the spatial combination step, $\alpha = 0.5$ offers optimal accuracy.

7. Significance and Comparative Context

DAGR-VQA is the first no-reference VQA method to embed register-tokens as global memory directly into convolutional feature extraction, rather than using saliency maps as auxiliary static inputs. This approach enables fast, HVS-inspired, temporally adaptive saliency prediction and provides a unified pipeline from spatio-temporal feature extraction to transformer-based quality regression. The method achieves $\sim\$0.900$ SRCC average on four benchmarks, with computational demands suitable for real-time deployment in multimedia streaming contexts (Mithila et al., 16 Jan 2026). A plausible implication is that the register-token approach may generalize to other vision tasks requiring stable, adaptive attention mechanisms across temporal sequences.

Markdown Report Issue Upgrade to Chat

References (1)

Convolutions Need Registers Too: HVS-Inspired Dynamic Attention for Video Quality Assessment (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA).