Temporal Attention Sharing Module (TASM)
- TASM is a neural network module that models temporal dependencies by sharing attention computations across multiple streams.
- It adapts diverse strategies—including LSTM-based routing, synchronized cross-attention, and multi-head mechanisms—to address domain-specific needs.
- Empirical studies show that TASM enhances performance and efficiency in reinforcement learning, human interaction analysis, and medical video segmentation.
A Temporal Attention Sharing Module (TASM) is a neural network component designed to model temporal dependencies across sequences by sharing attention computation or learned representations among multiple streams, tasks, or temporal segments. The TASM concept has emerged in several independent contexts—principally multi-task reinforcement learning, human–human interaction modeling, and medical video segmentation—each instantiating the module with unique architectural principles, attention mechanisms, and integration strategies to exploit shared temporal information for greater expressivity, robustness, and efficiency.
1. Core Motivations and Problem Domains
TASM has been proposed in response to the need for efficient and expressive temporal modeling where independent temporal encoding is suboptimal or standard temporal attention neglects critical dependencies:
- Multi-Task Reinforcement Learning: To address conflicts within tasks and between shared modules by enabling fine-grained, per-time-step routing of expert modules (Lan et al., 2023).
- Human-Human Interaction Analysis: To encode synchronous and asymmetric motion between individuals via tightly coupled temporal attention streams (Maeda et al., 15 Dec 2025).
- Medical Image Sequence Segmentation: To improve segmentation robustness by efficiently extracting temporal relations across video frames or volumetric slices with manageable computational costs (Hasan et al., 24 Jan 2025).
Each of these applications motivates a TASM variant that increases the capacity for context-aware, synchronized, or contrastive temporal feature integration.
2. Architectural Principles and Attention Mechanisms
Key architectural characteristics of TASM implementations are summarized in the following table:
| Paper/Domain | TASM Architecture Summary | Attention Type |
|---|---|---|
| (Lan et al., 2023) RL | K shared encoders; LSTM for temporal context; softmax over attention logits | Softmax, linear; per time |
| (Maeda et al., 15 Dec 2025) H2IAD | 2 parallel streams; synchronized positional encoding; stacked cross-attention TASUs | Scaled dot-product, shared |
| (Hasan et al., 24 Jan 2025) MedSeg | Multi-head cross-attention over frames; gating conv; concatenation and aggregation | Multi-head cross-attention |
Multi-Task RL (Lan et al., 2023):
- K shared expert modules encode the observation at each time step.
- A temporal encoder (single-layer LSTM) processes observation histories.
- Attention logits are produced by a linear transformation of the concatenated task embedding and LSTM state, followed by a softmax to yield mixture weights that combine expert outputs per step.
- Final features are supplied to policy and value networks; a contrastive loss regularizes the diversity among expert modules.
Human-Human Interaction (Maeda et al., 15 Dec 2025):
- Empirical synchronization is enforced by shared positional embedding added to both input streams (persons X and Y).
- Inside each Temporal Attention Sharing Unit (TASU): synchronized self-attention (per-individual), followed by motion cross-attention (queries from one, keys/values from the other), and distance cross-attention (injecting pairwise proximity embeddings).
- Full parameter sharing between streams enhances coupling and regularization. Stacking TASUs compounds these effects across temporal depth.
Medical Video Segmentation (Hasan et al., 24 Jan 2025):
- Temporally ordered feature maps from backbones (UNet, FCN8s, UNetR, SwinUNetR, I²UNet) are projected to , , , split into heads.
- For each frame pair, compute cross-attention with multi-head scaled dot-product, then apply channel-wise gating via convolution and a sigmoid.
- Gated attended features are concatenated with originals, processed via convolution, batch-norm, ReLU, then aggregated for temporal refinement.
3. Mathematical Formulation
Multi-Task RL (Lan et al., 2023)
At each time step :
- LSTM hidden:
- Task embedding:
- Attention logits:
- Weights:
- Aggregated encoding:
- Concatenation to policy/critic:
Contrastive regularization on expert outputs ensures module diversity:
Human-Human Interaction (Maeda et al., 15 Dec 2025)
Given frames, shared embedding :
- Input streams: ; add to each.
- Synchronized self-attention for each:
- Motion cross-attention: person query attends to and vice versa:
- Distance cross-attention: spatial relation embedding from dynamic distances, injected via further attention.
- Layer norm, residual; stack layers.
Medical Segmentation (Hasan et al., 24 Jan 2025)
- Query, key, value projections per frame: , etc., split into heads.
- Scaled dot-product multi-head cross-attention between frames :
- Channel-wise gating:
- Concatenate with , convolve, batch-norm, ReLU; aggregate temporally.
4. Design Variants and Implementation Details
- (Lan et al., 2023): K experts; LSTM of dimension ; -way softmax routing per time step; contrastive loss added to RL objective.
- (Maeda et al., 15 Dec 2025): Stacked 8 TASUs; embedding dimension ; synchronized, learnable positional embedding; all projection matrices and attention weights fully shared between streams.
- (Hasan et al., 24 Jan 2025): Usually attention heads; C, typically 256 or 512; 1x1 gating conv; module inserted at various levels in CNN or transformer backbones.
In all cases, the TASM has been designed as a general plug-in block, agnostic to most architectural details so long as input sequences share compatible temporal alignment.
5. Empirical Evaluation and Ablation Analyses
Multi-Task RL (Lan et al., 2023)
- Removing the LSTM’s hidden state from the attention combiner causes substantial drops in multi-task success rates, revealing temporal dynamic routing as critical for mitigating negative transfer within episodes.
- Disabling the contrastive loss collapses module diversity and harms final performance, especially in complex settings (MT50-Mixed).
- With proper TASM configuration, success rates reach 80% (MT10-Mixed) and 70% (MT50-Mixed), setting new benchmarks.
- An optimal expert count is around ; both insufficient and excessive degrade results.
Human-Human Interaction (Maeda et al., 15 Dec 2025)
- Parameter sharing across both person streams yields an 8.3% AUC improvement over independent streams.
- Synchronized positional embedding is necessary: switching to sinusoidal or non-shared embeddings drops AUC by 5–7%.
- Overall, the fully realized TASM (TASUs, parameter sharing, synchronized embedding) improves AUC by 17.5% on “Dance” and 13.5% on “Help up” versus ML-AAD.
Medical Segmentation (Hasan et al., 24 Jan 2025)
- TASM consistently improves Dice similarity (0.899 → 0.921 in FCN8s), reduces Hausdorff distance (6.38 mm → 3.31 mm), and suppresses spurious segment “islands” (PIA from 0.58% → 0.02%).
- Achieves %%%%4344%%%% fewer FLOPs and 30M fewer parameters than naïve Conv3D alternatives for temporal modeling.
- Compatible with diverse architectures (UNet, FCN8s, UNetR, SwinUNetR, I²UNet) and temporally flexible—optimal at 2–3 frames.
6. Limitations and Prospective Extensions
- Human-Human Interaction: Current implementations do not temporally localize anomalies within a sequence and lack modeling of human–object interactions. Possible extensions include multi-head (not single-head) cross-attention and hierarchies for modeling more than two agents (Maeda et al., 15 Dec 2025).
- Multi-Task RL: Excessive modularization ( too large) leads to redundancy and slower training, while too few experts constrain expressivity (Lan et al., 2023).
- Medical Segmentation: TASM requires at least two temporally aligned frames. Further extensions to arbitrary temporal contexts, or very long sequences, may need hierarchical or memory-efficient attention scaling (Hasan et al., 24 Jan 2025).
7. Cross-Domain Significance and Generalization
Despite differences in implementation, all TASM variants share the objective of enabling information flow across temporal dimensions with parameter or representation sharing to address overfitting, negative transfer, or loss of crucial temporal context. The alignment of temporal embeddings, explicit cross-attention, and fine-grained sharing strategies offer robust performance improvements across reinforcement learning, video-based medical imaging, and complex interactive sequence tasks.
By abstracting temporal dependencies with shared attention structures, TASM bridges multi-stream temporal modeling, per-step modular routing, and context conditional refinement, providing a unifying design pattern for sequence modeling architectures in machine learning research.
References
- "Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning" (Lan et al., 2023)
- "3D Human-Human Interaction Anomaly Detection" (Maeda et al., 15 Dec 2025)
- "Motion-enhancement to Echocardiography Segmentation via Inserting a Temporal Attention Module" (Hasan et al., 24 Jan 2025)