Token Fusion and Boost Modules

Updated 5 February 2026

Token Fusion and Boost Modules are architectural enhancements that adaptively fuse and boost token features to improve saliency prediction and denoising in visual and multi-modal tasks.
LTEB and DLTFB modules refine spatial-temporal representations by weighting, aggregating, and cyclically shifting tokens to capture salient cues with minimal computational overhead.
The TBM module reinforces token robustness during self-supervised pre-training by denoising corrupted features, achieving 2–4% accuracy gains on challenging benchmarks.

Token fusion and token boost modules are architectural enhancements designed for deep neural models operating on sets of tokenized representations, particularly in visual transformers and multi-modal encoders. These modules support tasks requiring either robust representation learning in the presence of noisy or ambiguous inputs, or discriminative aggregation of salient information across space, time, or modality. Leading examples include the Learnable Token Enhancement Block (LTEB) and Dynamic Learnable Token Fusion Block (DLTFB) for audio-visual saliency prediction (Hooshanfar et al., 14 Apr 2025), and the Token Boosting Module (TBM) for robust masked autoencoding pre-training of vision transformers (Li et al., 2023). This entry surveys their objectives, architectures, mathematical formulations, integration strategies, computational tradeoffs, and empirical impact.

1. Functional Objectives of Token Fusion and Boost Modules

Token fusion modules target flexible, data-dependent token aggregation and enhancement in encoder architectures where salient information is spatially, temporally, or cross-modally distributed and not known \emph{a priori}. LTEB, for instance, adaptively weights and fuses learned tokens against video frame features to emphasize salient cues for video saliency prediction.

Token boost modules are designed to improve the robustness of token representations, especially in self-supervised pre-training under noisy or corrupted data regimes. The TBM accomplishes this by denoising and “boosting” intermediate token features inside the transformer encoder, improving the resilience of masked autoencoding objectives to unreliable observations.

A comparative summary is below:

Module	Primary Objective	Typical Context
LTEB	Adaptive saliency cue enhancement	Video (audio-visual)
DLTFB	Dynamic spatial mixing of tokens	Video (spatio-temporal)
TBM	Robustifying via token denoising	Visual transformer pre-training

2. LTEB: Learnable Token Enhancement Block

LTEB operates on the refined feature map $F^R\in\mathbb{R}^{C^*\times T\times H\times W}$ generated by a multi-scale video encoder. Its design comprises:

Gating Branch: A 3D convolution ( $1\times3\times3$ kernel) followed by sigmoid activation yields a soft “importance” map $G\in[0,1]^{1\times T\times H\times W}$ that encodes pixel-level saliency across space and time.
Global Embedding & Token Weighting: Spatial–temporal average pooling distills $G$ into a vector $\mathbf{E}$ . A learnable linear layer $\mathbf{W}_{\text{lin}}\in\mathbb{R}^{N\times d}$ transforms $\mathbf{E}$ to softmax-normalized weights $\{w_i\}$ assigned to each of $N$ learned tokens $\{\mathbf{P}_i\}$ .
Token Aggregation: Token maps are weighted and summed:

$\mathbf{K}(x,y) = \sum_{i=1}^N w_i \mathbf{P}_i(x,y)$

Projection & Fusion: After upscaling and 3D convolution, the aggregated token feature $\mathbf{P}''$ is multiplicatively fused with $F^R$ and added residually.

LTEB is inserted after top-down feature fusion in each scale of the encoder, with maximal empirical performance when applied at all four pyramid stages (e.g., CC improvement from $0.539 \rightarrow 0.560$ , AUC-J from $0.911 \rightarrow 0.922$ on DHF1K) (Hooshanfar et al., 14 Apr 2025).

3. DLTFB: Dynamic Learnable Token Fusion Block

DLTFB dynamically intermixes token features by combining cyclic shift operations with lightweight convolutional processing:

Spatial Shifting: First, $F^R$ is circularly shifted along width (modulo $W$ ), then subject to a pointwise $1\times1\times1$ convolution, GELU activation, and $3\times3\times3$ convolution:

$Y = \mathrm{Conv3d}\bigl(\mathrm{GELU}(\mathrm{Conv}_{1\times1\times1}(\mathbf{F}_{\text{shift}}))\bigr)$

Second Shift and Residual: $Y$ is cyclically shifted along height, followed by LayerNorm and another $1\times1\times1$ convolution to yield $F^{\mathrm{Sh}}$ . Finally, the output fuses shifted and LTEB-weighted features multiplicatively and additively:

$F^{\mathrm{DLTFB}} = F^R \cdot \mathbf{P}'' + F^{\mathrm{Sh}}$

DLTFB is typically introduced only at the final encoder stage, where it provides measurable performance improvements (e.g., CC from $0.556 \rightarrow 0.561$ on DHF1K) and captures spatial long-range dependencies at low computational cost (Hooshanfar et al., 14 Apr 2025).

4. Token Boosting Module (TBM)

TBM is triggered at multiple depths within a visual transformer encoder, “cleaning” per-token features $f \in \mathbb{R}^D$ as follows (Li et al., 2023):

Synthetic Corruption: Gaussian noise $s\sim\mathcal{N}(0,I_D)$ , scaled by learnable per-dimension $\alpha$ , yields $q = \alpha \odot s$ .
Intermediate Feature: $I = f + q$ .
Reconstruction: An MLP-based autoencoder $g$ reconstructs $\hat{f} = g(I;\theta)$ .
Boosted Output: The boosted token is $r̂ = 2\hat{f} - I$ , and may be combined additively with $f$ as a residual.
Loss Augmentation: An auxiliary L $_2$ loss penalizes the difference between visible original and reconstructed features:

$L_{\text{recon}}(F, \hat{f}) = \lambda \| F - \hat{f} \|_2^2$

This is summed with the main MAE pre-training objective.

TBM provides a theoretically principled reduction in reconstruction variance in double-corruption settings and empirically delivers $+2$ – $4\%$ gains on corrupted benchmarks such as ImageNet-C across ViT/DeiT/Swin backbones. Using TBM at three encoder depths provides the strongest results.

5. Computational Tradeoffs

Token fusion/boost modules are designed for parameter and runtime efficiency relative to attention-heavy baselines:

LTEB: Adds one $3\times3\times3$ convolution, a small linear + softmax, interpolation, and another small convolution per encoder scale. Aggregate additional parameters total $\approx8$ M over the base encoder (Hooshanfar et al., 14 Apr 2025).
DLTFB: Adds two $1\times1\times1$ convolutions, one $3\times3\times3$ convolution, and two constant-time shift operations. Total model GFLOPs increases by $\leq5\%$ .
TBM: Requires per-token autoencoding (typically 3-layer MLPs) and maintains per-dimension noise-scale vectors. Integration is minimal, and the module is agnostic to backbone choice.

These strategies achieve strong accuracy increases at $<20\%$ total parameter overhead for full-scale video models, while preserving real-time and scalable operation (Hooshanfar et al., 14 Apr 2025, Li et al., 2023).

6. Empirical Impact and Comparison

Empirical evaluations demonstrate that:

In audio-visual saliency (DHF1K): LTEB increases CC by $\approx4\%$ and AUC-J by $\approx1\%$ ; DLTFB adds a further $\approx1\%$ . DTFSal with these modules surpasses prior token-based methods such as DiffSal, attaining strong results with $49$M parameters compared to $76$M (Hooshanfar et al., 14 Apr 2025).
In corrupted image and sequence tasks: TBM yields consistent improvements ( $+2$ – $4\%$ accuracy) for self-supervised and fine-tuned settings, across corruption types (Gaussian, blur, snow, JPEG), backbones, and modalities (RGB, skeletons, depth) (Li et al., 2023).

Ablation studies identify maximal utility when these modules are applied at multiple encoder depths or pyramid stages.

7. Integration Strategies and Broader Significance

These modules are designed as “plug-and-play” components: LTEB and DLTFB are placed in sequence atop each feature scale in a U-Net-like hierarchical encoder; TBM is interleaved between transformer sublayers at various encoder depths.

By replacing dense self-attention with learnable, data-driven token reweighting, local mixing, and denoising autoencoding, token fusion and boost modules preserve or enhance discriminative power while guaranteeing limited compute overhead. This enables state-of-the-art accuracy and robustness for tasks ranging from video saliency to noisy self-supervised learning, with broad applicability across vision and multi-modal domains (Hooshanfar et al., 14 Apr 2025, Li et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction (2025)

Token Boosting for Robust Self-Supervised Visual Transformer Pre-training (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token Fusion and Token Boost Modules.