TBT-Former: Temporal Boundary Transformer
- The paper introduces TBT-Former with a scaled Transformer backbone, cross-scale feature pyramid, and boundary distribution regression head to enhance temporal action localization.
- It employs a novel boundary distribution regression head that predicts discrete probability distributions for action boundaries, explicitly modeling uncertainty.
- Extensive evaluations on THUMOS14, ActivityNet, and EPIC-Kitchens confirm that TBT-Former outperforms predecessors through improved multi-scale fusion and robust localization.
The Temporal Boundary Transformer (TBT-Former) is a single-stage, anchor-free Transformer architecture designed for temporal action localization (TAL) in untrimmed videos. TAL addresses the identification of the start time, end time, and category for all action instances within a given video. TBT-Former targets two persistent issues in prior Transformer-based TAL frameworks: the imprecise localization of actions with ambiguous temporal boundaries and the limited fusion of multi-scale contextual information. It introduces three primary innovations—a scaled Transformer backbone, a cross-scale feature pyramid network (CS-FPN), and a boundary distribution regression head—enabling explicit modeling of boundary uncertainty and substantial empirical improvements over previous methods (Rathnayaka et al., 1 Dec 2025).
1. Model Architecture and Key Innovations
TBT-Former is constructed upon three architectural components:
a. Scaled Transformer Backbone:
The backbone processes a sequence of frame-level input features with stacked Transformer blocks. Each block has multi-head self-attention (MHSA) with heads and a feed-forward network (FFN) with hidden dimension (as compared to and in ActionFormer). The computational path per block is: where
with and .
b. Cross-Scale Feature Pyramid Network (CS-FPN):
To handle variable action durations, TBT-Former implements a feature pyramid by fusing outputs from backbone stages . The pyramid construction proceeds via lateral 1×1 convolutions, upsampling for top-down pathway, and 3×3 refinement convolutions: This bidirectional approach contrasts with the unidirectional pyramids used previously, enhancing the fusion of semantic and fine temporal details.
c. Boundary Distribution Regression (BDR) Head:
Instead of regressing scalar offsets for action boundaries, the BDR head predicts discrete probability distributions , over possible offset bins. The continuous boundary offsets are recovered by computing the expectation: This allows the model to represent and reason about boundary uncertainty directly.
2. Boundary Distribution Regression and Loss Formulation
The BDR head recasts boundary regression as distribution learning. For each anchor, the head outputs and (start and end boundary distributions).
Distribution Focal Loss (DFL):
Given a ground-truth continuous offset , the DFL focuses the loss on the two nearest bins: An analogous term is used for . The overall objective: where is the standard Focal Loss for classification and .
Modeling full distributions enables explicit handling of cases with ambiguous or "fuzzy" boundaries, reflected in the sharpness or spread of and . This formulation supports more stable convergence and improved localization accuracy.
3. Multi-Scale and Local Attention Mechanisms
TBT-Former employs local self-attention, limiting each token’s receptive field to a window of steps (optimal in THUMOS14 evaluation), thereby reducing the computational complexity to and promoting handling of long video sequences. Ablation studies reveal that yields the optimal average mAP on THUMOS14.
The CS-FPN structure features bidirectional fusion—both lateral and top-down—improving the representation of activities spanning multiple temporal scales. This multi-scale temporal encoding is essential for recognizing actions of varying lengths and dynamics.
4. Training Protocols and Implementation
Input Preparation:
Frame-wise features are pre-extracted at 8 fps using a Two-Stream I3D network, resulting in –dimensional features. Each training sample consists of frames (125s), uniformly sampled with overlap.
Optimization:
Training uses the AdamW optimizer (initial learning rate , weight decay , batch size 16). Learning rate scheduling includes a 1k iteration warm-up and cosine decay. The loss weight for regression is fixed at .
Inference:
At inference, all pyramid levels provide class scores and boundary offset estimates () via the BDR head. Segment proposals are retained if their class score exceeds a threshold (0.1–0.5), and non-maximum suppression with temporal IoU=0.4 is applied.
Implementation Overhead:
The CS-FPN adds under 5% runtime overhead. The BDR head requires two lightweight 1D conv layers for each of the offset distributions ( bins).
5. Empirical Evaluation and Comparative Performance
TBT-Former was benchmarked on THUMOS14, ActivityNet-1.3, and EPIC-Kitchens 100, using mean Average Precision (mAP) at multiple temporal IoU thresholds.
Summary of TBT-Former Results:
| Model | Avg. mAP (THUMOS14) | mAP @0.5 | mAP @0.7 | ActivityNet Avg. | EPIC-Kitchens Verb Avg. | EPIC-Kitchens Noun Avg. |
|---|---|---|---|---|---|---|
| ActionFormer | 66.8 | 71.0 | 43.9 | 36.6 | 23.5 | 21.9 |
| TBT-Former | 68.0 | 72.4 | 45.3 | 36.8 | 24.5 | 23.1 |
Ablation Summary (THUMOS14):
| Configuration | Avg. mAP | Δ from Baseline |
|---|---|---|
| ActionFormer (baseline) | 66.8 | – |
| + Scaled Backbone | 67.2 | +0.4 |
| + Cross-Scale FPN | 67.1 | +0.3 |
| + Boundary Distribution Head | 67.6 | +0.8 |
| Full TBT-Former (all combined) | 68.0 | +1.2 |
These results show consistent gains from each architectural enhancement, with the full model outperforming the ActionFormer baseline on all principal benchmarks. The highest single-stage mAP is achieved on THUMOS14 and EPIC-Kitchens 100 (Rathnayaka et al., 1 Dec 2025).
6. Relationship to Boundary-Based Transformers and Prior Work
Compared to earlier boundary Transformer models such as the Boundary Transformer module in TAPG Transformer (Wang et al., 2021), which predicts per-frame boundary probabilities via standard Transformer encoder–decoder blocks and a frame-wise binary logistic regression loss, TBT-Former introduces protocol-level innovations in both architecture and outputs. The Boundary Transformer first produces per-frame probabilities, later refined by a separate proposal Transformer, ultimately combining outputs through "fuzzy matching." In contrast, TBT-Former directly predicts distributions over boundary offsets and unifies multi-scale context within a single-stage design.
The probabilistic representation via BDR distinguishes TBT-Former from scalar regression (as in the original TAPG Transformer and ActionFormer paradigms) by supporting uncertainty modeling at the representation level. A plausible implication is that such distributional outputs are advantageous for understanding temporally ambiguous action events and could underpin further research in uncertainty-aware video understanding.
7. Significance and Future Directions
TBT-Former demonstrates empirically that architectural scaling, explicit multi-scale fusion, and distributional boundary modeling yield measurable advances in single-stage, anchor-free temporal action localization. It provides a blueprint for future architectures that benefit from modeling boundary uncertainty, especially in videos with ambiguous temporal structure, and sets a new performance standard on multiple video understanding benchmarks (Rathnayaka et al., 1 Dec 2025). As the field moves towards representations that are robust to ambiguity in annotation and inherent uncertainty in human-labeled boundaries, the principles embodied in TBT-Former's design are likely to inform subsequent temporal localization frameworks.