Token Fusion and Boost Modules
- Token Fusion and Boost Modules are architectural enhancements that adaptively fuse and boost token features to improve saliency prediction and denoising in visual and multi-modal tasks.
- LTEB and DLTFB modules refine spatial-temporal representations by weighting, aggregating, and cyclically shifting tokens to capture salient cues with minimal computational overhead.
- The TBM module reinforces token robustness during self-supervised pre-training by denoising corrupted features, achieving 2–4% accuracy gains on challenging benchmarks.
Token fusion and token boost modules are architectural enhancements designed for deep neural models operating on sets of tokenized representations, particularly in visual transformers and multi-modal encoders. These modules support tasks requiring either robust representation learning in the presence of noisy or ambiguous inputs, or discriminative aggregation of salient information across space, time, or modality. Leading examples include the Learnable Token Enhancement Block (LTEB) and Dynamic Learnable Token Fusion Block (DLTFB) for audio-visual saliency prediction (Hooshanfar et al., 14 Apr 2025), and the Token Boosting Module (TBM) for robust masked autoencoding pre-training of vision transformers (Li et al., 2023). This entry surveys their objectives, architectures, mathematical formulations, integration strategies, computational tradeoffs, and empirical impact.
1. Functional Objectives of Token Fusion and Boost Modules
Token fusion modules target flexible, data-dependent token aggregation and enhancement in encoder architectures where salient information is spatially, temporally, or cross-modally distributed and not known \emph{a priori}. LTEB, for instance, adaptively weights and fuses learned tokens against video frame features to emphasize salient cues for video saliency prediction.
Token boost modules are designed to improve the robustness of token representations, especially in self-supervised pre-training under noisy or corrupted data regimes. The TBM accomplishes this by denoising and “boosting” intermediate token features inside the transformer encoder, improving the resilience of masked autoencoding objectives to unreliable observations.
A comparative summary is below:
| Module | Primary Objective | Typical Context |
|---|---|---|
| LTEB | Adaptive saliency cue enhancement | Video (audio-visual) |
| DLTFB | Dynamic spatial mixing of tokens | Video (spatio-temporal) |
| TBM | Robustifying via token denoising | Visual transformer pre-training |
2. LTEB: Learnable Token Enhancement Block
LTEB operates on the refined feature map generated by a multi-scale video encoder. Its design comprises:
- Gating Branch: A 3D convolution ( kernel) followed by sigmoid activation yields a soft “importance” map that encodes pixel-level saliency across space and time.
- Global Embedding & Token Weighting: Spatial–temporal average pooling distills into a vector . A learnable linear layer transforms to softmax-normalized weights assigned to each of learned tokens .
- Token Aggregation: Token maps are weighted and summed:
- Projection & Fusion: After upscaling and 3D convolution, the aggregated token feature is multiplicatively fused with and added residually.
LTEB is inserted after top-down feature fusion in each scale of the encoder, with maximal empirical performance when applied at all four pyramid stages (e.g., CC improvement from , AUC-J from on DHF1K) (Hooshanfar et al., 14 Apr 2025).
3. DLTFB: Dynamic Learnable Token Fusion Block
DLTFB dynamically intermixes token features by combining cyclic shift operations with lightweight convolutional processing:
- Spatial Shifting: First, is circularly shifted along width (modulo ), then subject to a pointwise convolution, GELU activation, and convolution:
- Second Shift and Residual: is cyclically shifted along height, followed by LayerNorm and another convolution to yield . Finally, the output fuses shifted and LTEB-weighted features multiplicatively and additively:
DLTFB is typically introduced only at the final encoder stage, where it provides measurable performance improvements (e.g., CC from on DHF1K) and captures spatial long-range dependencies at low computational cost (Hooshanfar et al., 14 Apr 2025).
4. Token Boosting Module (TBM)
TBM is triggered at multiple depths within a visual transformer encoder, “cleaning” per-token features as follows (Li et al., 2023):
- Synthetic Corruption: Gaussian noise , scaled by learnable per-dimension , yields .
- Intermediate Feature: .
- Reconstruction: An MLP-based autoencoder reconstructs .
- Boosted Output: The boosted token is , and may be combined additively with as a residual.
- Loss Augmentation: An auxiliary L loss penalizes the difference between visible original and reconstructed features:
This is summed with the main MAE pre-training objective.
TBM provides a theoretically principled reduction in reconstruction variance in double-corruption settings and empirically delivers – gains on corrupted benchmarks such as ImageNet-C across ViT/DeiT/Swin backbones. Using TBM at three encoder depths provides the strongest results.
5. Computational Tradeoffs
Token fusion/boost modules are designed for parameter and runtime efficiency relative to attention-heavy baselines:
- LTEB: Adds one convolution, a small linear + softmax, interpolation, and another small convolution per encoder scale. Aggregate additional parameters total M over the base encoder (Hooshanfar et al., 14 Apr 2025).
- DLTFB: Adds two convolutions, one convolution, and two constant-time shift operations. Total model GFLOPs increases by .
- TBM: Requires per-token autoencoding (typically 3-layer MLPs) and maintains per-dimension noise-scale vectors. Integration is minimal, and the module is agnostic to backbone choice.
These strategies achieve strong accuracy increases at total parameter overhead for full-scale video models, while preserving real-time and scalable operation (Hooshanfar et al., 14 Apr 2025, Li et al., 2023).
6. Empirical Impact and Comparison
Empirical evaluations demonstrate that:
- In audio-visual saliency (DHF1K): LTEB increases CC by and AUC-J by ; DLTFB adds a further . DTFSal with these modules surpasses prior token-based methods such as DiffSal, attaining strong results with $49$M parameters compared to $76$M (Hooshanfar et al., 14 Apr 2025).
- In corrupted image and sequence tasks: TBM yields consistent improvements (– accuracy) for self-supervised and fine-tuned settings, across corruption types (Gaussian, blur, snow, JPEG), backbones, and modalities (RGB, skeletons, depth) (Li et al., 2023).
Ablation studies identify maximal utility when these modules are applied at multiple encoder depths or pyramid stages.
7. Integration Strategies and Broader Significance
These modules are designed as “plug-and-play” components: LTEB and DLTFB are placed in sequence atop each feature scale in a U-Net-like hierarchical encoder; TBM is interleaved between transformer sublayers at various encoder depths.
By replacing dense self-attention with learnable, data-driven token reweighting, local mixing, and denoising autoencoding, token fusion and boost modules preserve or enhance discriminative power while guaranteeing limited compute overhead. This enables state-of-the-art accuracy and robustness for tasks ranging from video saliency to noisy self-supervised learning, with broad applicability across vision and multi-modal domains (Hooshanfar et al., 14 Apr 2025, Li et al., 2023).