TBT-Former: Temporal Boundary Transformer

Updated 8 December 2025

The paper introduces TBT-Former with a scaled Transformer backbone, cross-scale feature pyramid, and boundary distribution regression head to enhance temporal action localization.
It employs a novel boundary distribution regression head that predicts discrete probability distributions for action boundaries, explicitly modeling uncertainty.
Extensive evaluations on THUMOS14, ActivityNet, and EPIC-Kitchens confirm that TBT-Former outperforms predecessors through improved multi-scale fusion and robust localization.

The Temporal Boundary Transformer (TBT-Former) is a single-stage, anchor-free Transformer architecture designed for temporal action localization (TAL) in untrimmed videos. TAL addresses the identification of the start time, end time, and category for all action instances within a given video. TBT-Former targets two persistent issues in prior Transformer-based TAL frameworks: the imprecise localization of actions with ambiguous temporal boundaries and the limited fusion of multi-scale contextual information. It introduces three primary innovations—a scaled Transformer backbone, a cross-scale feature pyramid network (CS-FPN), and a boundary distribution regression head—enabling explicit modeling of boundary uncertainty and substantial empirical improvements over previous methods (Rathnayaka et al., 1 Dec 2025).

1. Model Architecture and Key Innovations

TBT-Former is constructed upon three architectural components:

a. Scaled Transformer Backbone:

The backbone processes a sequence of frame-level input features $X\in\mathbb R^{L\times D}$ with $N=6$ stacked Transformer blocks. Each block has multi-head self-attention (MHSA) with $H=16$ heads and a feed-forward network (FFN) with hidden dimension $D_{ff}=6D$ (as compared to $H=8$ and $D_{ff}=4D$ in ActionFormer). The computational path per block is: $\tilde Z = \mathrm{LayerNorm}(Z)\ A = \mathrm{MHSA}_{H,D}( \tilde Z ) + Z\ B = \mathrm{LayerNorm}(A)\ Z' = \mathrm{FFN}_{6D} (B) + A$ where

$\mathrm{MFSA}(Z) = \sum_{h=1}^H \mathrm{Attention}_h(Z), \quad \mathrm{FFN}(x)=W_2\,\sigma(W_1 x + b_1)+b_2$

with $W_1\in\mathbb R^{6D\times D}$ and $W_2\in\mathbb R^{D\times 6D}$ .

b. Cross-Scale Feature Pyramid Network (CS-FPN):

To handle variable action durations, TBT-Former implements a feature pyramid $\{P_2,P_3,P_4,P_5\}$ by fusing outputs from backbone stages $\{C_2,C_3,C_4,C_5\}$ . The pyramid construction proceeds via lateral 1×1 convolutions, upsampling for top-down pathway, and 3×3 refinement convolutions: $P_5 = \mathrm{Conv}_{1\times1}(C_5)\ P_i = \mathrm{Conv}_{1\times1}(C_i) + \mathrm{Upsample}(P_{i+1})\,,\quad i=4,3,2$ This bidirectional approach contrasts with the unidirectional pyramids used previously, enhancing the fusion of semantic and fine temporal details.

c. Boundary Distribution Regression (BDR) Head:

Instead of regressing scalar offsets $(d^s, d^e)$ for action boundaries, the BDR head predicts discrete probability distributions $P_s$ , $P_e$ over $W$ possible offset bins. The continuous boundary offsets are recovered by computing the expectation: $\hat d^s = \sum_{i=0}^{W-1} i\, p_s(i),\quad \hat d^e = \sum_{i=0}^{W-1} i\, p_e(i)$ This allows the model to represent and reason about boundary uncertainty directly.

2. Boundary Distribution Regression and Loss Formulation

The BDR head recasts boundary regression as distribution learning. For each anchor, the head outputs $P_s$ and $P_e$ (start and end boundary distributions).

Distribution Focal Loss (DFL):

Given a ground-truth continuous offset $d_{gt}^s \in [i, i+1]$ , the DFL focuses the loss on the two nearest bins: $\mathcal L_\mathrm{DFL}(P_s, d^s_{gt}) = -[(i+1 - d^s_{gt}) \log p_s(i) + (d^s_{gt} - i) \log p_s(i+1)]$ An analogous term is used for $P_e$ . The overall objective: $\mathcal L = \mathcal L_{cls} + \lambda[\mathcal L_\mathrm{DFL}(P_s,d_{gt}^s)+\mathcal L_\mathrm{DFL}(P_e,d_{gt}^e)]$ where $\mathcal L_{cls}$ is the standard Focal Loss for classification and $\lambda=1.0$ .

Modeling full distributions enables explicit handling of cases with ambiguous or "fuzzy" boundaries, reflected in the sharpness or spread of $P_s$ and $P_e$ . This formulation supports more stable convergence and improved localization accuracy.

3. Multi-Scale and Local Attention Mechanisms

TBT-Former employs local self-attention, limiting each token’s receptive field to a window of $w$ steps (optimal $w=30$ in THUMOS14 evaluation), thereby reducing the computational complexity to $O(L\cdot w\cdot D)$ and promoting handling of long video sequences. Ablation studies reveal that $w=30$ yields the optimal average mAP on THUMOS14.

The CS-FPN structure features bidirectional fusion—both lateral and top-down—improving the representation of activities spanning multiple temporal scales. This multi-scale temporal encoding is essential for recognizing actions of varying lengths and dynamics.

4. Training Protocols and Implementation

Input Preparation:

Frame-wise features are pre-extracted at 8 fps using a Two-Stream I3D network, resulting in $D=400$ –dimensional features. Each training sample consists of $L=1000$ frames ( $\approx$ 125s), uniformly sampled with overlap.

Optimization:

Training uses the AdamW optimizer (initial learning rate $10^{-4}$ , weight decay $10^{-2}$ , batch size 16). Learning rate scheduling includes a 1k iteration warm-up and cosine decay. The loss weight for regression is fixed at $\lambda_{\mathrm{reg}}=1.0$ .

Inference:

At inference, all pyramid levels $P_i$ provide class scores and boundary offset estimates ( $\hat d^s, \hat d^e$ ) via the BDR head. Segment proposals are retained if their class score exceeds a threshold (0.1–0.5), and non-maximum suppression with temporal IoU=0.4 is applied.

Implementation Overhead:

The CS-FPN adds under 5% runtime overhead. The BDR head requires two lightweight 1D conv layers for each of the offset distributions ( $W=64$ bins).

5. Empirical Evaluation and Comparative Performance

TBT-Former was benchmarked on THUMOS14, ActivityNet-1.3, and EPIC-Kitchens 100, using mean Average Precision (mAP) at multiple temporal IoU thresholds.

Summary of TBT-Former Results:

Model	Avg. mAP (THUMOS14)	mAP @0.5	mAP @0.7	ActivityNet Avg.	EPIC-Kitchens Verb Avg.	EPIC-Kitchens Noun Avg.
ActionFormer	66.8	71.0	43.9	36.6	23.5	21.9
TBT-Former	68.0	72.4	45.3	36.8	24.5	23.1

Ablation Summary (THUMOS14):

Configuration	Avg. mAP	Δ from Baseline
ActionFormer (baseline)	66.8	–
+ Scaled Backbone	67.2	+0.4
+ Cross-Scale FPN	67.1	+0.3
+ Boundary Distribution Head	67.6	+0.8
Full TBT-Former (all combined)	68.0	+1.2

These results show consistent gains from each architectural enhancement, with the full model outperforming the ActionFormer baseline on all principal benchmarks. The highest single-stage mAP is achieved on THUMOS14 and EPIC-Kitchens 100 (Rathnayaka et al., 1 Dec 2025).

6. Relationship to Boundary-Based Transformers and Prior Work

Compared to earlier boundary Transformer models such as the Boundary Transformer module in TAPG Transformer (Wang et al., 2021), which predicts per-frame boundary probabilities via standard Transformer encoder–decoder blocks and a frame-wise binary logistic regression loss, TBT-Former introduces protocol-level innovations in both architecture and outputs. The Boundary Transformer first produces per-frame probabilities, later refined by a separate proposal Transformer, ultimately combining outputs through "fuzzy matching." In contrast, TBT-Former directly predicts distributions over boundary offsets and unifies multi-scale context within a single-stage design.

The probabilistic representation via BDR distinguishes TBT-Former from scalar regression (as in the original TAPG Transformer and ActionFormer paradigms) by supporting uncertainty modeling at the representation level. A plausible implication is that such distributional outputs are advantageous for understanding temporally ambiguous action events and could underpin further research in uncertainty-aware video understanding.

7. Significance and Future Directions

TBT-Former demonstrates empirically that architectural scaling, explicit multi-scale fusion, and distributional boundary modeling yield measurable advances in single-stage, anchor-free temporal action localization. It provides a blueprint for future architectures that benefit from modeling boundary uncertainty, especially in videos with ambiguous temporal structure, and sets a new performance standard on multiple video understanding benchmarks (Rathnayaka et al., 1 Dec 2025). As the field moves towards representations that are robust to ambiguity in annotation and inherent uncertainty in human-labeled boundaries, the principles embodied in TBT-Former's design are likely to inform subsequent temporal localization frameworks.

PDF Markdown Chat (Pro)

References (2)

TBT-Former: Learning Temporal Boundary Distributions for Action Localization (2025)

Temporal Action Proposal Generation with Transformers (2021)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Temporal Boundary Transformer (TBT-Former).

TBT-Former: Temporal Boundary Transformer

1. Model Architecture and Key Innovations

2. Boundary Distribution Regression and Loss Formulation

3. Multi-Scale and Local Attention Mechanisms

4. Training Protocols and Implementation

5. Empirical Evaluation and Comparative Performance

6. Relationship to Boundary-Based Transformers and Prior Work

7. Significance and Future Directions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

TBT-Former: Temporal Boundary Transformer

1. Model Architecture and Key Innovations

2. Boundary Distribution Regression and Loss Formulation

3. Multi-Scale and Local Attention Mechanisms

4. Training Protocols and Implementation

5. Empirical Evaluation and Comparative Performance

6. Relationship to Boundary-Based Transformers and Prior Work

7. Significance and Future Directions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research