Video Action Transformer Network (VATN)

Updated 9 February 2026

The paper introduces VATN, which integrates Transformer-based attention with a two-stage Faster R-CNN pipeline to aggregate features from spatiotemporal context.
VATN employs an I3D trunk for feature extraction and a high-resolution Transformer head that uses multi-head attention for contextual reasoning and precise action localization.
Experiments on the AVA benchmark show that VATN achieves 24.93 mAP, outperforming prior models and demonstrating effective emergent tracking and focus on key human regions.

The Video Action Transformer Network (VATN) is a model for spatiotemporal human action recognition and localization in video, integrating the Transformer attention mechanism with region-based video understanding. Developed as an Action Transformer, VATN adapts Transformer architectures to aggregate features from spatiotemporal context specifically centered around person proposals, enabling recognition and localization using only raw RGB video frames and supervised by bounding boxes and class labels. VATN advances the state-of-the-art on the Atomic Visual Actions (AVA) benchmark with significant gains over previous models using a Faster R-CNN-style pipeline (Girdhar et al., 2018).

1. Model Architecture and Overall Pipeline

VATN employs a two-stage Faster R-CNN-style pipeline for temporal action localization in video:

Trunk Network: The input is a $T$ -frame RGB clip of spatial resolution $H\times W$ ( $T=64$ , $H=W=400$ ), centered on a key-frame. Feature extraction uses the I3D (Inflated 3D ConvNet) trunk up to the Mixed_4f block, pretrained on Kinetics-400. The output feature tensor has reduced temporal and spatial resolution:

$T' = T/4,\quad H' = H/16,\quad W' = W/16,\quad D_{\text{trunk}} \approx 1024.$

The central temporal slice ( $t=T'/2$ ) is input to the Region Proposal Network (RPN).

Region Proposal Network (RPN): The RPN identifies $R$ person proposals in the central frame, ranked by objectness; at full scale, $R=300$ is used.
Head Networks:
- I3D-Head (Baseline): Proposals are extended across time to form tubes, and spatiotemporal RoIPooling yields $T'\times 7\times 7$ features. These are processed by the remaining I3D layers (Mixed_5a–5c), followed by linear classification and bounding-box regression.
- Action Transformer Head (VATN): Proposals use only the central frame for each query, with the full $(T', H', W')$ feature volume providing the keys and values for the Transformer. Multi-head, multi-layer attention aggregates contextual information for human action classification and localization.
Outputs: For each proposal, the network produces multi-label classification scores (via sigmoid cross-entropy) for $C=80$ AVA classes, alongside class-agnostic bounding-box regression (smooth-L1).

2. Transformer-Based Attention Mechanism

The core of the VATN head is the Transformer attention block, designed for contextual reasoning in video. For each proposal $r$ :

Input Variables:
- Query: $Q^{(r)}\in \mathbb{R}^D$
- Keys: $K\in\mathbb{R}^{T'\times H'\times W'\times D}$
- Values: $V\in\mathbb{R}^{T'\times H'\times W'\times D}$
Attention Computation:

$a^{(r)}_{x,y,t} = \frac{Q^{(r)}(K_{x,y,t})^T}{\sqrt{D}}$

$\alpha^{(r)}_{x,y,t} = \text{Softmax}_{x,y,t}(a^{(r)})$

$A^{(r)} = \sum_{x,y,t} \alpha^{(r)}_{x,y,t} V_{x,y,t}\in\mathbb{R}^D$

Multi-head attention utilizes learned projections $W^Q_h, W^K_h, W^V_h\in\mathbb{R}^{D\times d_k}$ , $W^O\in\mathbb{R}^{H d_k\times D}$ :

$\text{head}_h = \text{Attention}(QW^Q_h, K W^K_h, V W^V_h)\in\mathbb{R}^{d_k}$

$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \dots, \text{head}_H) W^O \in\mathbb{R}^{D}$

Layering: Each Transformer unit applies multi-head attention, followed by add & layer normalization, a position-wise 2-layer MLP with ReLU, dropout, and normalization:

$Q^{(r)\prime} = \text{LayerNorm}(Q^{(r)} + \text{Dropout}(\text{MultiHead}(Q^{(r)}, K, V)))$

$\text{FFN}(x) = (\max(0, xW_1+b_1))W_2+b_2$

$Q^{(r)\prime\prime} = \text{LayerNorm}(Q^{(r)\prime} + \text{Dropout}(\text{FFN}(Q^{(r)\prime})))$

Stacking $L$ such layers with $H$ heads enriches the query vector for subsequent prediction.

3. High-Resolution, Class-Agnostic Query Encoding

VATN's query representation for each proposal is constructed via a HighRes Query Preprocessor (QPr):

Extract a $7\times 7\times D_{\rm trunk}$ RoIPooled feature from the central frame.
Apply a $1\times1$ convolution to reduce depth to $C_q$ channels.
Flatten the $7\times 7$ spatial grid to a vector of length $49 C_q$ .
Use a learned linear layer to obtain a $D$ -dimensional query vector for the Transformer.

Each $Q^{(r)}$ remains class-agnostic, representing the individual only. The model is compelled, via classification supervision alone, to learn body parts, track individuals, and focus on semantically important regions (hands, faces, and objects) across space-time, without instance- or part-level supervision.

4. Spatiotemporal Positional Encoding

To mitigate the permutation invariance of the Transformer, VATN incorporates explicit position information:

For each feature cell $(x, y, t)$ , the system computes normalized coordinates:

$p_{xy} = \left(\frac{x}{H'} - \frac{1}{2}, \frac{y}{W'} - \frac{1}{2}\right),\quad q_t = \frac{t-(T'/2)}{T'}$

Spatial and temporal positions are separately embedded via 2-layer MLPs:

$\ell^{\sf spatial}(p_{xy}) \in \mathbb{R}^{d_p},\qquad \ell^{\sf temporal}(q_t)\in\mathbb{R}^{d_t}$

The concatenated positional embedding $L_{x,y,t}$ is appended to each feature cell, giving:

$F_{x,y,t} \leftarrow [F_{x,y,t}; L_{x,y,t}]\in\mathbb{R}^{D_{trunk}+d_p+d_t}$

Keys and values for the Transformer are derived via linear projection from this augmented feature map, and queries inherit spatial cues accordingly.

5. Loss Formulation

VATN uses the following multi-task loss for each proposal $r$ :

Multi-label Classification:

$\mathcal{L}_{\rm cls} = -\sum_{c=1}^{C} \left[ y_c \log \sigma(s_c) + (1-y_c)\log (1-\sigma(s_c)) \right]$

where $s_c$ are logits, $y_c\in\{0,1\}$ , and $\sigma$ is sigmoid.

Bounding-Box Regression:

$\mathcal{L}_{\rm reg} = \sum_{i\in\{x,y,w,h\}} \text{Smooth}_{L_1}(t_i-t^*_i)$

Only positive proposals contribute to regression loss.

Combined Loss:

$\mathcal{L} = \frac{1}{N} \sum_{r=1}^{N} [\mathcal{L}_{\rm cls}^{(r)} + \lambda \mathcal{L}_{\rm reg}^{(r)} ]$

with $\lambda=1$ in practice.

6. Training Procedures and Hyperparameters

Initialization: I3D trunk pre-trained on Kinetics-400; all new layers initialized randomly. BatchNorm in I3D is frozen.
Data Augmentation: Random horizontal flip and spatial crop to $400\times400$ to counteract overfitting.
Optimization: Synchronized SGD over 10 GPUs (effective batch size 30), initial learning rate 0.01 (warmup to 0.1, then cosine annealing over 500k iterations). Some experiments use shorter schedules (300k) with ground-truth boxes.
Transformer Configuration: $D=128$ , dropout rate 0.3, typically 2 heads $\times$ 3 layers.
Proposals: $R=300$ (full-scale), $R=64$ for ablation.

7. Performance and Ablation Results

Quantitative Outcomes on AVA (v2.1)

Head/Setting	Action Classification mAP	Localization mAP (IoU ≥ 0.5)
I3D Head (GT boxes, 64 prop)	23.4	92.9
Transformer LowRes	29.1	77.5
Transformer HighRes	27.6	87.7
I3D Head (RPN, 300 prop)	20.5	—
Transformer HighRes (RPN)	24.4	—
Combined (reg/cls)	24.9	—

Test set performance: VATN achieves 24.93 mAP (test), outperforming prior best ensemble-free RGB+flow results (21.08 mAP) by 3.8 points.

Ablation Studies

Regression: Switching from class-agnostic to class-specific regression reduces mAP (21.3 → 19.2).
Data Augmentation: Removing augmentation lowers mAP (21.3 → 16.6).
Pretraining: Training from scratch (no Kinetics) yields 19.1 mAP (vs. 21.3 with pretraining).
Depth/Width Trade-off (GT boxes): Best results are with 6 layers × 2 heads (29.1 mAP).

Emergent Tracking and Context

Without explicit supervision, the action transformer head learns to:

Track individuals over frames by clustering body pixel attentions.
Distinguish between nearby people as instance-specific keys emerge.
Emphasize hands, faces, and manipulated objects in its attention, supporting fine-grained action classification.

These properties emerge from repeated attention of each query over the full spatiotemporal feature volume, combined with only final action classification supervision; tracking and body-part segmentation are not directly supervised (Girdhar et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Video Action Transformer Network (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Video Action Transformer Network (VATN).