Papers
Topics
Authors
Recent
2000 character limit reached

TBT-Former: Temporal Boundary Transformer

Updated 8 December 2025
  • The paper introduces TBT-Former with a scaled Transformer backbone, cross-scale feature pyramid, and boundary distribution regression head to enhance temporal action localization.
  • It employs a novel boundary distribution regression head that predicts discrete probability distributions for action boundaries, explicitly modeling uncertainty.
  • Extensive evaluations on THUMOS14, ActivityNet, and EPIC-Kitchens confirm that TBT-Former outperforms predecessors through improved multi-scale fusion and robust localization.

The Temporal Boundary Transformer (TBT-Former) is a single-stage, anchor-free Transformer architecture designed for temporal action localization (TAL) in untrimmed videos. TAL addresses the identification of the start time, end time, and category for all action instances within a given video. TBT-Former targets two persistent issues in prior Transformer-based TAL frameworks: the imprecise localization of actions with ambiguous temporal boundaries and the limited fusion of multi-scale contextual information. It introduces three primary innovations—a scaled Transformer backbone, a cross-scale feature pyramid network (CS-FPN), and a boundary distribution regression head—enabling explicit modeling of boundary uncertainty and substantial empirical improvements over previous methods (Rathnayaka et al., 1 Dec 2025).

1. Model Architecture and Key Innovations

TBT-Former is constructed upon three architectural components:

a. Scaled Transformer Backbone:

The backbone processes a sequence of frame-level input features XRL×DX\in\mathbb R^{L\times D} with N=6N=6 stacked Transformer blocks. Each block has multi-head self-attention (MHSA) with H=16H=16 heads and a feed-forward network (FFN) with hidden dimension Dff=6DD_{ff}=6D (as compared to H=8H=8 and Dff=4DD_{ff}=4D in ActionFormer). The computational path per block is: Z~=LayerNorm(Z) A=MHSAH,D(Z~)+Z B=LayerNorm(A) Z=FFN6D(B)+A\tilde Z = \mathrm{LayerNorm}(Z)\ A = \mathrm{MHSA}_{H,D}( \tilde Z ) + Z\ B = \mathrm{LayerNorm}(A)\ Z' = \mathrm{FFN}_{6D} (B) + A where

MFSA(Z)=h=1HAttentionh(Z),FFN(x)=W2σ(W1x+b1)+b2\mathrm{MFSA}(Z) = \sum_{h=1}^H \mathrm{Attention}_h(Z), \quad \mathrm{FFN}(x)=W_2\,\sigma(W_1 x + b_1)+b_2

with W1R6D×DW_1\in\mathbb R^{6D\times D} and W2RD×6DW_2\in\mathbb R^{D\times 6D}.

b. Cross-Scale Feature Pyramid Network (CS-FPN):

To handle variable action durations, TBT-Former implements a feature pyramid {P2,P3,P4,P5}\{P_2,P_3,P_4,P_5\} by fusing outputs from backbone stages {C2,C3,C4,C5}\{C_2,C_3,C_4,C_5\}. The pyramid construction proceeds via lateral 1×1 convolutions, upsampling for top-down pathway, and 3×3 refinement convolutions: P5=Conv1×1(C5) Pi=Conv1×1(Ci)+Upsample(Pi+1),i=4,3,2P_5 = \mathrm{Conv}_{1\times1}(C_5)\ P_i = \mathrm{Conv}_{1\times1}(C_i) + \mathrm{Upsample}(P_{i+1})\,,\quad i=4,3,2 This bidirectional approach contrasts with the unidirectional pyramids used previously, enhancing the fusion of semantic and fine temporal details.

c. Boundary Distribution Regression (BDR) Head:

Instead of regressing scalar offsets (ds,de)(d^s, d^e) for action boundaries, the BDR head predicts discrete probability distributions PsP_s, PeP_e over WW possible offset bins. The continuous boundary offsets are recovered by computing the expectation: d^s=i=0W1ips(i),d^e=i=0W1ipe(i)\hat d^s = \sum_{i=0}^{W-1} i\, p_s(i),\quad \hat d^e = \sum_{i=0}^{W-1} i\, p_e(i) This allows the model to represent and reason about boundary uncertainty directly.

2. Boundary Distribution Regression and Loss Formulation

The BDR head recasts boundary regression as distribution learning. For each anchor, the head outputs PsP_s and PeP_e (start and end boundary distributions).

Distribution Focal Loss (DFL):

Given a ground-truth continuous offset dgts[i,i+1]d_{gt}^s \in [i, i+1], the DFL focuses the loss on the two nearest bins: LDFL(Ps,dgts)=[(i+1dgts)logps(i)+(dgtsi)logps(i+1)]\mathcal L_\mathrm{DFL}(P_s, d^s_{gt}) = -[(i+1 - d^s_{gt}) \log p_s(i) + (d^s_{gt} - i) \log p_s(i+1)] An analogous term is used for PeP_e. The overall objective: L=Lcls+λ[LDFL(Ps,dgts)+LDFL(Pe,dgte)]\mathcal L = \mathcal L_{cls} + \lambda[\mathcal L_\mathrm{DFL}(P_s,d_{gt}^s)+\mathcal L_\mathrm{DFL}(P_e,d_{gt}^e)] where Lcls\mathcal L_{cls} is the standard Focal Loss for classification and λ=1.0\lambda=1.0.

Modeling full distributions enables explicit handling of cases with ambiguous or "fuzzy" boundaries, reflected in the sharpness or spread of PsP_s and PeP_e. This formulation supports more stable convergence and improved localization accuracy.

3. Multi-Scale and Local Attention Mechanisms

TBT-Former employs local self-attention, limiting each token’s receptive field to a window of ww steps (optimal w=30w=30 in THUMOS14 evaluation), thereby reducing the computational complexity to O(LwD)O(L\cdot w\cdot D) and promoting handling of long video sequences. Ablation studies reveal that w=30w=30 yields the optimal average mAP on THUMOS14.

The CS-FPN structure features bidirectional fusion—both lateral and top-down—improving the representation of activities spanning multiple temporal scales. This multi-scale temporal encoding is essential for recognizing actions of varying lengths and dynamics.

4. Training Protocols and Implementation

Input Preparation:

Frame-wise features are pre-extracted at 8 fps using a Two-Stream I3D network, resulting in D=400D=400–dimensional features. Each training sample consists of L=1000L=1000 frames (\approx125s), uniformly sampled with overlap.

Optimization:

Training uses the AdamW optimizer (initial learning rate 10410^{-4}, weight decay 10210^{-2}, batch size 16). Learning rate scheduling includes a 1k iteration warm-up and cosine decay. The loss weight for regression is fixed at λreg=1.0\lambda_{\mathrm{reg}}=1.0.

Inference:

At inference, all pyramid levels PiP_i provide class scores and boundary offset estimates (d^s,d^e\hat d^s, \hat d^e) via the BDR head. Segment proposals are retained if their class score exceeds a threshold (0.1–0.5), and non-maximum suppression with temporal IoU=0.4 is applied.

Implementation Overhead:

The CS-FPN adds under 5% runtime overhead. The BDR head requires two lightweight 1D conv layers for each of the offset distributions (W=64W=64 bins).

5. Empirical Evaluation and Comparative Performance

TBT-Former was benchmarked on THUMOS14, ActivityNet-1.3, and EPIC-Kitchens 100, using mean Average Precision (mAP) at multiple temporal IoU thresholds.

Summary of TBT-Former Results:

Model Avg. mAP (THUMOS14) mAP @0.5 mAP @0.7 ActivityNet Avg. EPIC-Kitchens Verb Avg. EPIC-Kitchens Noun Avg.
ActionFormer 66.8 71.0 43.9 36.6 23.5 21.9
TBT-Former 68.0 72.4 45.3 36.8 24.5 23.1

Ablation Summary (THUMOS14):

Configuration Avg. mAP Δ from Baseline
ActionFormer (baseline) 66.8
+ Scaled Backbone 67.2 +0.4
+ Cross-Scale FPN 67.1 +0.3
+ Boundary Distribution Head 67.6 +0.8
Full TBT-Former (all combined) 68.0 +1.2

These results show consistent gains from each architectural enhancement, with the full model outperforming the ActionFormer baseline on all principal benchmarks. The highest single-stage mAP is achieved on THUMOS14 and EPIC-Kitchens 100 (Rathnayaka et al., 1 Dec 2025).

6. Relationship to Boundary-Based Transformers and Prior Work

Compared to earlier boundary Transformer models such as the Boundary Transformer module in TAPG Transformer (Wang et al., 2021), which predicts per-frame boundary probabilities via standard Transformer encoder–decoder blocks and a frame-wise binary logistic regression loss, TBT-Former introduces protocol-level innovations in both architecture and outputs. The Boundary Transformer first produces per-frame probabilities, later refined by a separate proposal Transformer, ultimately combining outputs through "fuzzy matching." In contrast, TBT-Former directly predicts distributions over boundary offsets and unifies multi-scale context within a single-stage design.

The probabilistic representation via BDR distinguishes TBT-Former from scalar regression (as in the original TAPG Transformer and ActionFormer paradigms) by supporting uncertainty modeling at the representation level. A plausible implication is that such distributional outputs are advantageous for understanding temporally ambiguous action events and could underpin further research in uncertainty-aware video understanding.

7. Significance and Future Directions

TBT-Former demonstrates empirically that architectural scaling, explicit multi-scale fusion, and distributional boundary modeling yield measurable advances in single-stage, anchor-free temporal action localization. It provides a blueprint for future architectures that benefit from modeling boundary uncertainty, especially in videos with ambiguous temporal structure, and sets a new performance standard on multiple video understanding benchmarks (Rathnayaka et al., 1 Dec 2025). As the field moves towards representations that are robust to ambiguity in annotation and inherent uncertainty in human-labeled boundaries, the principles embodied in TBT-Former's design are likely to inform subsequent temporal localization frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Temporal Boundary Transformer (TBT-Former).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube