Video Action Transformer Network (VATN)
- The paper introduces VATN, which integrates Transformer-based attention with a two-stage Faster R-CNN pipeline to aggregate features from spatiotemporal context.
- VATN employs an I3D trunk for feature extraction and a high-resolution Transformer head that uses multi-head attention for contextual reasoning and precise action localization.
- Experiments on the AVA benchmark show that VATN achieves 24.93 mAP, outperforming prior models and demonstrating effective emergent tracking and focus on key human regions.
The Video Action Transformer Network (VATN) is a model for spatiotemporal human action recognition and localization in video, integrating the Transformer attention mechanism with region-based video understanding. Developed as an Action Transformer, VATN adapts Transformer architectures to aggregate features from spatiotemporal context specifically centered around person proposals, enabling recognition and localization using only raw RGB video frames and supervised by bounding boxes and class labels. VATN advances the state-of-the-art on the Atomic Visual Actions (AVA) benchmark with significant gains over previous models using a Faster R-CNN-style pipeline (Girdhar et al., 2018).
1. Model Architecture and Overall Pipeline
VATN employs a two-stage Faster R-CNN-style pipeline for temporal action localization in video:
- Trunk Network: The input is a -frame RGB clip of spatial resolution (, ), centered on a key-frame. Feature extraction uses the I3D (Inflated 3D ConvNet) trunk up to the Mixed_4f block, pretrained on Kinetics-400. The output feature tensor has reduced temporal and spatial resolution:
The central temporal slice () is input to the Region Proposal Network (RPN).
- Region Proposal Network (RPN): The RPN identifies person proposals in the central frame, ranked by objectness; at full scale, is used.
- Head Networks:
- I3D-Head (Baseline): Proposals are extended across time to form tubes, and spatiotemporal RoIPooling yields features. These are processed by the remaining I3D layers (Mixed_5a–5c), followed by linear classification and bounding-box regression.
- Action Transformer Head (VATN): Proposals use only the central frame for each query, with the full feature volume providing the keys and values for the Transformer. Multi-head, multi-layer attention aggregates contextual information for human action classification and localization.
- Outputs: For each proposal, the network produces multi-label classification scores (via sigmoid cross-entropy) for AVA classes, alongside class-agnostic bounding-box regression (smooth-L1).
2. Transformer-Based Attention Mechanism
The core of the VATN head is the Transformer attention block, designed for contextual reasoning in video. For each proposal :
- Input Variables:
- Query:
- Keys:
- Values:
- Attention Computation:
Multi-head attention utilizes learned projections , :
- Layering: Each Transformer unit applies multi-head attention, followed by add & layer normalization, a position-wise 2-layer MLP with ReLU, dropout, and normalization:
Stacking such layers with heads enriches the query vector for subsequent prediction.
3. High-Resolution, Class-Agnostic Query Encoding
VATN's query representation for each proposal is constructed via a HighRes Query Preprocessor (QPr):
- Extract a RoIPooled feature from the central frame.
- Apply a convolution to reduce depth to channels.
- Flatten the spatial grid to a vector of length .
- Use a learned linear layer to obtain a -dimensional query vector for the Transformer.
Each remains class-agnostic, representing the individual only. The model is compelled, via classification supervision alone, to learn body parts, track individuals, and focus on semantically important regions (hands, faces, and objects) across space-time, without instance- or part-level supervision.
4. Spatiotemporal Positional Encoding
To mitigate the permutation invariance of the Transformer, VATN incorporates explicit position information:
For each feature cell , the system computes normalized coordinates:
Spatial and temporal positions are separately embedded via 2-layer MLPs:
The concatenated positional embedding is appended to each feature cell, giving:
Keys and values for the Transformer are derived via linear projection from this augmented feature map, and queries inherit spatial cues accordingly.
5. Loss Formulation
VATN uses the following multi-task loss for each proposal :
- Multi-label Classification:
where are logits, , and is sigmoid.
- Bounding-Box Regression:
Only positive proposals contribute to regression loss.
- Combined Loss:
with in practice.
6. Training Procedures and Hyperparameters
- Initialization: I3D trunk pre-trained on Kinetics-400; all new layers initialized randomly. BatchNorm in I3D is frozen.
- Data Augmentation: Random horizontal flip and spatial crop to to counteract overfitting.
- Optimization: Synchronized SGD over 10 GPUs (effective batch size 30), initial learning rate 0.01 (warmup to 0.1, then cosine annealing over 500k iterations). Some experiments use shorter schedules (300k) with ground-truth boxes.
- Transformer Configuration: , dropout rate 0.3, typically 2 heads 3 layers.
- Proposals: (full-scale), for ablation.
7. Performance and Ablation Results
Quantitative Outcomes on AVA (v2.1)
| Head/Setting | Action Classification mAP | Localization mAP (IoU ≥ 0.5) |
|---|---|---|
| I3D Head (GT boxes, 64 prop) | 23.4 | 92.9 |
| Transformer LowRes | 29.1 | 77.5 |
| Transformer HighRes | 27.6 | 87.7 |
| I3D Head (RPN, 300 prop) | 20.5 | — |
| Transformer HighRes (RPN) | 24.4 | — |
| Combined (reg/cls) | 24.9 | — |
Test set performance: VATN achieves 24.93 mAP (test), outperforming prior best ensemble-free RGB+flow results (21.08 mAP) by 3.8 points.
Ablation Studies
- Regression: Switching from class-agnostic to class-specific regression reduces mAP (21.3 → 19.2).
- Data Augmentation: Removing augmentation lowers mAP (21.3 → 16.6).
- Pretraining: Training from scratch (no Kinetics) yields 19.1 mAP (vs. 21.3 with pretraining).
- Depth/Width Trade-off (GT boxes): Best results are with 6 layers × 2 heads (29.1 mAP).
Emergent Tracking and Context
Without explicit supervision, the action transformer head learns to:
- Track individuals over frames by clustering body pixel attentions.
- Distinguish between nearby people as instance-specific keys emerge.
- Emphasize hands, faces, and manipulated objects in its attention, supporting fine-grained action classification.
These properties emerge from repeated attention of each query over the full spatiotemporal feature volume, combined with only final action classification supervision; tracking and body-part segmentation are not directly supervised (Girdhar et al., 2018).