Papers
Topics
Authors
Recent
Search
2000 character limit reached

Token Fusion and Boost Modules

Updated 5 February 2026
  • Token Fusion and Boost Modules are architectural enhancements that adaptively fuse and boost token features to improve saliency prediction and denoising in visual and multi-modal tasks.
  • LTEB and DLTFB modules refine spatial-temporal representations by weighting, aggregating, and cyclically shifting tokens to capture salient cues with minimal computational overhead.
  • The TBM module reinforces token robustness during self-supervised pre-training by denoising corrupted features, achieving 2–4% accuracy gains on challenging benchmarks.

Token fusion and token boost modules are architectural enhancements designed for deep neural models operating on sets of tokenized representations, particularly in visual transformers and multi-modal encoders. These modules support tasks requiring either robust representation learning in the presence of noisy or ambiguous inputs, or discriminative aggregation of salient information across space, time, or modality. Leading examples include the Learnable Token Enhancement Block (LTEB) and Dynamic Learnable Token Fusion Block (DLTFB) for audio-visual saliency prediction (Hooshanfar et al., 14 Apr 2025), and the Token Boosting Module (TBM) for robust masked autoencoding pre-training of vision transformers (Li et al., 2023). This entry surveys their objectives, architectures, mathematical formulations, integration strategies, computational tradeoffs, and empirical impact.

1. Functional Objectives of Token Fusion and Boost Modules

Token fusion modules target flexible, data-dependent token aggregation and enhancement in encoder architectures where salient information is spatially, temporally, or cross-modally distributed and not known \emph{a priori}. LTEB, for instance, adaptively weights and fuses learned tokens against video frame features to emphasize salient cues for video saliency prediction.

Token boost modules are designed to improve the robustness of token representations, especially in self-supervised pre-training under noisy or corrupted data regimes. The TBM accomplishes this by denoising and “boosting” intermediate token features inside the transformer encoder, improving the resilience of masked autoencoding objectives to unreliable observations.

A comparative summary is below:

Module Primary Objective Typical Context
LTEB Adaptive saliency cue enhancement Video (audio-visual)
DLTFB Dynamic spatial mixing of tokens Video (spatio-temporal)
TBM Robustifying via token denoising Visual transformer pre-training

2. LTEB: Learnable Token Enhancement Block

LTEB operates on the refined feature map FRRC×T×H×WF^R\in\mathbb{R}^{C^*\times T\times H\times W} generated by a multi-scale video encoder. Its design comprises:

  • Gating Branch: A 3D convolution (1×3×31\times3\times3 kernel) followed by sigmoid activation yields a soft “importance” map G[0,1]1×T×H×WG\in[0,1]^{1\times T\times H\times W} that encodes pixel-level saliency across space and time.
  • Global Embedding & Token Weighting: Spatial–temporal average pooling distills GG into a vector E\mathbf{E}. A learnable linear layer WlinRN×d\mathbf{W}_{\text{lin}}\in\mathbb{R}^{N\times d} transforms E\mathbf{E} to softmax-normalized weights {wi}\{w_i\} assigned to each of NN learned tokens {Pi}\{\mathbf{P}_i\}.
  • Token Aggregation: Token maps are weighted and summed:

K(x,y)=i=1NwiPi(x,y)\mathbf{K}(x,y) = \sum_{i=1}^N w_i \mathbf{P}_i(x,y)

  • Projection & Fusion: After upscaling and 3D convolution, the aggregated token feature P\mathbf{P}'' is multiplicatively fused with FRF^R and added residually.

LTEB is inserted after top-down feature fusion in each scale of the encoder, with maximal empirical performance when applied at all four pyramid stages (e.g., CC improvement from 0.5390.5600.539 \rightarrow 0.560, AUC-J from 0.9110.9220.911 \rightarrow 0.922 on DHF1K) (Hooshanfar et al., 14 Apr 2025).

3. DLTFB: Dynamic Learnable Token Fusion Block

DLTFB dynamically intermixes token features by combining cyclic shift operations with lightweight convolutional processing:

  • Spatial Shifting: First, FRF^R is circularly shifted along width (modulo WW), then subject to a pointwise 1×1×11\times1\times1 convolution, GELU activation, and 3×3×33\times3\times3 convolution:

Y=Conv3d(GELU(Conv1×1×1(Fshift)))Y = \mathrm{Conv3d}\bigl(\mathrm{GELU}(\mathrm{Conv}_{1\times1\times1}(\mathbf{F}_{\text{shift}}))\bigr)

  • Second Shift and Residual: YY is cyclically shifted along height, followed by LayerNorm and another 1×1×11\times1\times1 convolution to yield FShF^{\mathrm{Sh}}. Finally, the output fuses shifted and LTEB-weighted features multiplicatively and additively:

FDLTFB=FRP+FShF^{\mathrm{DLTFB}} = F^R \cdot \mathbf{P}'' + F^{\mathrm{Sh}}

DLTFB is typically introduced only at the final encoder stage, where it provides measurable performance improvements (e.g., CC from 0.5560.5610.556 \rightarrow 0.561 on DHF1K) and captures spatial long-range dependencies at low computational cost (Hooshanfar et al., 14 Apr 2025).

4. Token Boosting Module (TBM)

TBM is triggered at multiple depths within a visual transformer encoder, “cleaning” per-token features fRDf \in \mathbb{R}^D as follows (Li et al., 2023):

  1. Synthetic Corruption: Gaussian noise sN(0,ID)s\sim\mathcal{N}(0,I_D), scaled by learnable per-dimension α\alpha, yields q=αsq = \alpha \odot s.
  2. Intermediate Feature: I=f+qI = f + q.
  3. Reconstruction: An MLP-based autoencoder gg reconstructs f^=g(I;θ)\hat{f} = g(I;\theta).
  4. Boosted Output: The boosted token is r^=2f^Ir̂ = 2\hat{f} - I, and may be combined additively with ff as a residual.
  5. Loss Augmentation: An auxiliary L2_2 loss penalizes the difference between visible original and reconstructed features:

Lrecon(F,f^)=λFf^22L_{\text{recon}}(F, \hat{f}) = \lambda \| F - \hat{f} \|_2^2

This is summed with the main MAE pre-training objective.

TBM provides a theoretically principled reduction in reconstruction variance in double-corruption settings and empirically delivers +2+24%4\% gains on corrupted benchmarks such as ImageNet-C across ViT/DeiT/Swin backbones. Using TBM at three encoder depths provides the strongest results.

5. Computational Tradeoffs

Token fusion/boost modules are designed for parameter and runtime efficiency relative to attention-heavy baselines:

  • LTEB: Adds one 3×3×33\times3\times3 convolution, a small linear + softmax, interpolation, and another small convolution per encoder scale. Aggregate additional parameters total 8\approx8M over the base encoder (Hooshanfar et al., 14 Apr 2025).
  • DLTFB: Adds two 1×1×11\times1\times1 convolutions, one 3×3×33\times3\times3 convolution, and two constant-time shift operations. Total model GFLOPs increases by 5%\leq5\%.
  • TBM: Requires per-token autoencoding (typically 3-layer MLPs) and maintains per-dimension noise-scale vectors. Integration is minimal, and the module is agnostic to backbone choice.

These strategies achieve strong accuracy increases at <20%<20\% total parameter overhead for full-scale video models, while preserving real-time and scalable operation (Hooshanfar et al., 14 Apr 2025, Li et al., 2023).

6. Empirical Impact and Comparison

Empirical evaluations demonstrate that:

  • In audio-visual saliency (DHF1K): LTEB increases CC by 4%\approx4\% and AUC-J by 1%\approx1\%; DLTFB adds a further 1%\approx1\%. DTFSal with these modules surpasses prior token-based methods such as DiffSal, attaining strong results with $49$M parameters compared to $76$M (Hooshanfar et al., 14 Apr 2025).
  • In corrupted image and sequence tasks: TBM yields consistent improvements (+2+24%4\% accuracy) for self-supervised and fine-tuned settings, across corruption types (Gaussian, blur, snow, JPEG), backbones, and modalities (RGB, skeletons, depth) (Li et al., 2023).

Ablation studies identify maximal utility when these modules are applied at multiple encoder depths or pyramid stages.

7. Integration Strategies and Broader Significance

These modules are designed as “plug-and-play” components: LTEB and DLTFB are placed in sequence atop each feature scale in a U-Net-like hierarchical encoder; TBM is interleaved between transformer sublayers at various encoder depths.

By replacing dense self-attention with learnable, data-driven token reweighting, local mixing, and denoising autoencoding, token fusion and boost modules preserve or enhance discriminative power while guaranteeing limited compute overhead. This enables state-of-the-art accuracy and robustness for tasks ranging from video saliency to noisy self-supervised learning, with broad applicability across vision and multi-modal domains (Hooshanfar et al., 14 Apr 2025, Li et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token Fusion and Token Boost Modules.