Temporal Attention Sharing Module (TASM)

Updated 22 December 2025

TASM is a neural network module that models temporal dependencies by sharing attention computations across multiple streams.
It adapts diverse strategies—including LSTM-based routing, synchronized cross-attention, and multi-head mechanisms—to address domain-specific needs.
Empirical studies show that TASM enhances performance and efficiency in reinforcement learning, human interaction analysis, and medical video segmentation.

A Temporal Attention Sharing Module (TASM) is a neural network component designed to model temporal dependencies across sequences by sharing attention computation or learned representations among multiple streams, tasks, or temporal segments. The TASM concept has emerged in several independent contexts—principally multi-task reinforcement learning, human–human interaction modeling, and medical video segmentation—each instantiating the module with unique architectural principles, attention mechanisms, and integration strategies to exploit shared temporal information for greater expressivity, robustness, and efficiency.

1. Core Motivations and Problem Domains

TASM has been proposed in response to the need for efficient and expressive temporal modeling where independent temporal encoding is suboptimal or standard temporal attention neglects critical dependencies:

Multi-Task Reinforcement Learning: To address conflicts within tasks and between shared modules by enabling fine-grained, per-time-step routing of expert modules (Lan et al., 2023).
Human-Human Interaction Analysis: To encode synchronous and asymmetric motion between individuals via tightly coupled temporal attention streams (Maeda et al., 15 Dec 2025).
Medical Image Sequence Segmentation: To improve segmentation robustness by efficiently extracting temporal relations across video frames or volumetric slices with manageable computational costs (Hasan et al., 24 Jan 2025).

Each of these applications motivates a TASM variant that increases the capacity for context-aware, synchronized, or contrastive temporal feature integration.

2. Architectural Principles and Attention Mechanisms

Key architectural characteristics of TASM implementations are summarized in the following table:

Paper/Domain	TASM Architecture Summary	Attention Type
(Lan et al., 2023) RL	K shared encoders; LSTM for temporal context; softmax over attention logits	Softmax, linear; per time
(Maeda et al., 15 Dec 2025) H2IAD	2 parallel streams; synchronized positional encoding; stacked cross-attention TASUs	Scaled dot-product, shared
(Hasan et al., 24 Jan 2025) MedSeg	Multi-head cross-attention over frames; gating conv; concatenation and aggregation	Multi-head cross-attention

Multi-Task RL (Lan et al., 2023):

K shared expert modules encode the observation at each time step.
A temporal encoder (single-layer LSTM) processes observation histories.
Attention logits are produced by a linear transformation of the concatenated task embedding and LSTM state, followed by a softmax to yield mixture weights $\alpha_j$ that combine expert outputs per step.
Final features are supplied to policy and value networks; a contrastive loss regularizes the diversity among expert modules.

Human-Human Interaction (Maeda et al., 15 Dec 2025):

Empirical synchronization is enforced by shared positional embedding added to both input streams (persons X and Y).
Inside each Temporal Attention Sharing Unit (TASU): synchronized self-attention (per-individual), followed by motion cross-attention (queries from one, keys/values from the other), and distance cross-attention (injecting pairwise proximity embeddings).
Full parameter sharing between streams enhances coupling and regularization. Stacking $N=8$ TASUs compounds these effects across temporal depth.

Medical Video Segmentation (Hasan et al., 24 Jan 2025):

Temporally ordered feature maps from backbones (UNet, FCN8s, UNetR, SwinUNetR, I²UNet) are projected to $Q$ , $K$ , $V$ , split into $H$ heads.
For each $(i, j)$ frame pair, compute cross-attention with multi-head scaled dot-product, then apply channel-wise gating via $1 \times 1$ convolution and a sigmoid.
Gated attended features are concatenated with originals, processed via convolution, batch-norm, ReLU, then aggregated for temporal refinement.

3. Mathematical Formulation

At each time step $t$ :

LSTM hidden: $h_t = \mathrm{LSTM}(s_t, h_{t-1})$
Task embedding: $z_{\mathrm{task}} = g(z_\tau)$
Attention logits: $\ell_j(t) = [\mathcal{W}([z_\mathrm{task}; h_t])]_j$
Weights: $\alpha_j(t) = \frac{\exp(\ell_j(t))}{\sum_{m=1}^K \exp(\ell_m(t))}$
Aggregated encoding: $z_\mathrm{enc}(t) = \sum_{j=1}^K \alpha_j(t) z_\mathrm{enc}^j$
Concatenation to policy/critic: $z = [z_\mathrm{task}; z_\mathrm{enc}]$

Contrastive regularization on expert outputs ensures module diversity:

$L_{\mathrm{con}}(t) = \sum_{i=1}^K -\log \frac{\exp(q_i \cdot k_i^+ / \tau)}{\exp(q_i \cdot k_i^+ / \tau) + \sum_{j \neq i} \exp(q_i \cdot k_{i,j}^- / \tau)}$

Given $T$ frames, shared embedding $H \in \mathbb{R}^{T \times E}$ :

Input streams: $f^{0 X}, f^{0 Y} \in \mathbb{R}^{T \times E}$ ; add $H$ to each.
Synchronized self-attention for each: $f^S = \mathrm{Atten}(Q, K, V)$
Motion cross-attention: person $X$ query $f^S_X$ attends to $f^S_Y$ and vice versa:

$\text{For stream } X: z_t^X = \sum_{t'} \alpha_{t,t'}^X V_{t'}^Y,\quad \alpha_{t,t'}^X = \mathrm{softmax}_{t'} (Q_t^X \cdot (K_{t'}^Y)^\top / \sqrt{C})$

Distance cross-attention: spatial relation embedding from dynamic distances, $m_t^{(i,j)} = -\|x_t^{(i)} - y_t^{(j)}\|_2$ injected via further attention.
Layer norm, residual; stack $N=8$ layers.

Query, key, value projections per frame: $Q_t = F_t W_q^T + b_q$ , etc., split into $H$ heads.
Scaled dot-product multi-head cross-attention between frames $i \leftarrow j$ :

$A_{i \leftarrow j}^h = \mathrm{softmax}\left(\frac{Q_i^h (K_j^h)^T}{\sqrt{d_\text{embed} / H}}\right) V_j^h$

Channel-wise gating: $G_{i \leftarrow j} = \sigma(W_G * A_{i \leftarrow j}^{mh} + b_G)$
Concatenate $A^{gated}_{i \leftarrow j}$ with $F_i$ , convolve, batch-norm, ReLU; aggregate temporally.

4. Design Variants and Implementation Details

(Lan et al., 2023): K experts; LSTM of dimension $d_h$ ; $K$ -way softmax routing per time step; contrastive loss added to RL objective.
(Maeda et al., 15 Dec 2025): Stacked 8 TASUs; embedding dimension $E=512$ ; synchronized, learnable positional embedding; all projection matrices and attention weights fully shared between streams.
(Hasan et al., 24 Jan 2025): Usually $H=8$ attention heads; C, $d_\text{embed}$ typically 256 or 512; 1x1 gating conv; module inserted at various levels in CNN or transformer backbones.

In all cases, the TASM has been designed as a general plug-in block, agnostic to most architectural details so long as input sequences share compatible temporal alignment.

5. Empirical Evaluation and Ablation Analyses

Removing the LSTM’s hidden state from the attention combiner causes substantial drops in multi-task success rates, revealing temporal dynamic routing as critical for mitigating negative transfer within episodes.
Disabling the contrastive loss collapses module diversity and harms final performance, especially in complex settings (MT50-Mixed).
With proper TASM configuration, success rates reach $\geq$ 80% (MT10-Mixed) and $\geq$ 70% (MT50-Mixed), setting new benchmarks.
An optimal expert count is around $K=6$ ; both insufficient and excessive $K$ degrade results.

Parameter sharing across both person streams yields an 8.3% AUC improvement over independent streams.
Synchronized positional embedding is necessary: switching to sinusoidal or non-shared embeddings drops AUC by 5–7%.
Overall, the fully realized TASM (TASUs, parameter sharing, synchronized embedding) improves AUC by 17.5% on “Dance” and 13.5% on “Help up” versus ML-AAD.

TASM consistently improves Dice similarity (0.899 → 0.921 in FCN8s), reduces Hausdorff distance (6.38 mm → 3.31 mm), and suppresses spurious segment “islands” (PIA from 0.58% → 0.02%).
Achieves %%%%43 $H$ 44%%%% fewer FLOPs and $\sim$ 30M fewer parameters than naïve Conv3D alternatives for temporal modeling.
Compatible with diverse architectures (UNet, FCN8s, UNetR, SwinUNetR, I²UNet) and temporally flexible—optimal at 2–3 frames.

6. Limitations and Prospective Extensions

Human-Human Interaction: Current implementations do not temporally localize anomalies within a sequence and lack modeling of human–object interactions. Possible extensions include multi-head (not single-head) cross-attention and hierarchies for modeling more than two agents (Maeda et al., 15 Dec 2025).
Multi-Task RL: Excessive modularization ( $K$ too large) leads to redundancy and slower training, while too few experts constrain expressivity (Lan et al., 2023).
Medical Segmentation: TASM requires at least two temporally aligned frames. Further extensions to arbitrary temporal contexts, or very long sequences, may need hierarchical or memory-efficient attention scaling (Hasan et al., 24 Jan 2025).

7. Cross-Domain Significance and Generalization

Despite differences in implementation, all TASM variants share the objective of enabling information flow across temporal dimensions with parameter or representation sharing to address overfitting, negative transfer, or loss of crucial temporal context. The alignment of temporal embeddings, explicit cross-attention, and fine-grained sharing strategies offer robust performance improvements across reinforcement learning, video-based medical imaging, and complex interactive sequence tasks.

By abstracting temporal dependencies with shared attention structures, TASM bridges multi-stream temporal modeling, per-step modular routing, and context conditional refinement, providing a unifying design pattern for sequence modeling architectures in machine learning research.

References

"Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning" (Lan et al., 2023)
"3D Human-Human Interaction Anomaly Detection" (Maeda et al., 15 Dec 2025)
"Motion-enhancement to Echocardiography Segmentation via Inserting a Temporal Attention Module" (Hasan et al., 24 Jan 2025)

PDF Markdown Chat (Pro)

References (3)

Contrastive Modules with Temporal Attention for Multi-Task Reinforcement Learning (2023)

3D Human-Human Interaction Anomaly Detection (2025)

Motion-enhancement to Echocardiography Segmentation via Inserting a Temporal Attention Module: An Efficient, Adaptable, and Scalable Approach (2025)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Temporal Attention Sharing Module (TASM).

Temporal Attention Sharing Module (TASM)

1. Core Motivations and Problem Domains

2. Architectural Principles and Attention Mechanisms

3. Mathematical Formulation

Multi-Task RL (Lan et al., 2023)

Human-Human Interaction (Maeda et al., 15 Dec 2025)

Medical Segmentation (Hasan et al., 24 Jan 2025)

4. Design Variants and Implementation Details

5. Empirical Evaluation and Ablation Analyses

Multi-Task RL (Lan et al., 2023)

Human-Human Interaction (Maeda et al., 15 Dec 2025)

Medical Segmentation (Hasan et al., 24 Jan 2025)

6. Limitations and Prospective Extensions

7. Cross-Domain Significance and Generalization

References

Whiteboard

Follow Topic

Temporal Attention Sharing Module (TASM)

1. Core Motivations and Problem Domains

2. Architectural Principles and Attention Mechanisms

3. Mathematical Formulation

Multi-Task RL (Lan et al., 2023)

Human-Human Interaction (Maeda et al., 15 Dec 2025)

Medical Segmentation (Hasan et al., 24 Jan 2025)

4. Design Variants and Implementation Details

5. Empirical Evaluation and Ablation Analyses

Multi-Task RL (Lan et al., 2023)

Human-Human Interaction (Maeda et al., 15 Dec 2025)

Medical Segmentation (Hasan et al., 24 Jan 2025)

6. Limitations and Prospective Extensions

7. Cross-Domain Significance and Generalization

References

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics