DAGR-VQA: Global Registers for Video Quality
- The paper introduces global register-tokens that inject scene-level context into a 3D-UNet, enabling dynamic attention for temporally adaptive saliency maps.
- It integrates a lightweight temporal transformer with convex saliency fusion to efficiently predict mean opinion scores in real-time.
- Comprehensive ablations confirm that dynamic attention and register-tokens significantly boost saliency metrics and overall VQA performance compared to static baselines.
Dynamic Attention with Global Registers for Video Quality Assessment (DAGR-VQA) is a no-reference video quality assessment framework that introduces learnable global register-tokens directly into a convolutional backbone, enabling dynamic, spatio-temporal attention mechanisms inspired by the human visual system (HVS). DAGR-VQA produces temporally adaptive saliency maps, integrates them with raw video frames, and uses a lightweight @@@@1@@@@ to deliver perceptually consistent, state-of-the-art video quality predictions at real-time inference speeds (Mithila et al., 16 Jan 2026).
1. Architectural Foundations
The core architecture of DAGR-VQA is built on a 3D-UNet encoder–decoder structure that operates on video clips comprising frames, each of spatial size and RGB channels. A principal innovation is the injection of learnable register-tokens at the very first convolutional layer. These tokens serve as global context carriers, fundamentally integrating scene-level priors within the convolutional feature extraction process.
Formally, let denote the number of register-tokens (default ), each of embedding dimension matching the initial channel width. The register-token tensor is initialized as and broadcast via a learned 3D convolution to align with the input video volume:
The input is then augmented as
with the original video tensor. The subsequent convolutional layers operate over both pixel and global register-token channels, embedding dynamic scene context at each hierarchical layer.
2. Dynamic Attention and Saliency Prediction
After feature encoding by the UNet3D encoder , a bottleneck representation is obtained. An attention mask , where is a sigmoid function, gates the encoded features through element-wise multiplication with a small 3D convolutional block . This yields refined bottleneck features:
The decoder reconstructs per-frame dynamic saliency maps as
Here, the global register-tokens, through gradient-based training, serve as a compact 'scene memory' that is broadcast through each forward pass, dynamically biasing saliency computation without explicit motion estimation.
3. Saliency Integration and Video Quality Assessment Pipeline
Saliency prediction is integrated within the overall video quality assessment pipeline via spatial fusion and temporal regression:
- For each input frame , the predicted saliency is fused using a convex combination:
- Spatial features are extracted by a ResNet-50 backbone with global average pooling (GAP):
- Temporal structure is encoded by adding sinusoidal positional encodings and passing through temporal transformer encoder layers:
- The aggregate spatial and transformer features,
are concatenated and regressed to the mean opinion score (MOS):
4. Supervised Training Protocols
4.1. Saliency Pre-training
Saliency predictors are pre-trained on ground-truth maps using a combined Kullback–Leibler (KL) divergence and Pearson correlation objective:
where
The optimizer is Adam with learning rate and batch size $4$ for $180$ epochs on DHF1K.
4.2. VQA Fine-tuning
Fine-tuning for MOS prediction employs a loss combining regression and Spearman rank correlation:
where
Optimization uses Adam with learning rate (cosine annealing), batch size $5$, and $300$ training epochs. The fusion weight is fixed at $0.5$.
5. Empirical Performance and Computational Analysis
DAGR-VQA demonstrates state-of-the-art accuracy on four large-scale user-generated content (UGC) benchmarks, substantially outperforming static-attention and non-saliency baselines. The following table summarizes performance (PLCC/SRCC):
| Dataset | PLCC | SRCC |
|---|---|---|
| LSVQ | 0.892 | 0.907 |
| KonVid-1k | 0.863 | 0.896 |
| LIVE-VQC | 0.915 | 0.886 |
| YouTube-UGC | 0.913 | 0.910 |
| Average | 0.896 | 0.900 |
Computational efficiency is a key strength, with DAGR-VQA requiring $59$ GFLOPs for an eight-frame clip (versus $141$ GFLOPs for ViViT) and achieving $387.7$ FPS at $1080p$ resolution (RTX A5000, $2400$-frame normalization). The complexity scales as , with spatial terms dominating for .
6. Ablations and Mechanistic Insights
Ablation studies empirically validate the contributions of register-tokens and dynamic saliency mechanisms:
- Removing register-tokens (i.e., static attention only) degrades saliency metrics (e.g., NSS drops from $3.683$ to $2.945$, CC from $0.704$ to $0.640$, AUC-J from $0.942$ to $0.904$).
- Adding tokens yields up to relative improvement in standard saliency scores.
- For VQA, the dynamic saliency + register-token configuration consistently exceeds both static and non-saliency models across all SRCC values.
- Qualitative tracking shows temporally consistent saliency following moving objects.
- In cross-database transfer (e.g., LSVQKoNViD), DAGR-VQA achieves the highest median SRCC among five strong baselines, indicating robust generalization.
- Performance peaks at register-tokens; excessive tokens yield diminishing returns.
- Varying the fusion weight in the spatial combination step, offers optimal accuracy.
7. Significance and Comparative Context
DAGR-VQA is the first no-reference VQA method to embed register-tokens as global memory directly into convolutional feature extraction, rather than using saliency maps as auxiliary static inputs. This approach enables fast, HVS-inspired, temporally adaptive saliency prediction and provides a unified pipeline from spatio-temporal feature extraction to transformer-based quality regression. The method achieves $\sim\$0.900$ SRCC average on four benchmarks, with computational demands suitable for real-time deployment in multimedia streaming contexts (Mithila et al., 16 Jan 2026). A plausible implication is that the register-token approach may generalize to other vision tasks requiring stable, adaptive attention mechanisms across temporal sequences.