Papers
Topics
Authors
Recent
2000 character limit reached

Selective Interaction Module (SIM)

Updated 25 November 2025
  • Selective Interaction Module (SIM) is a differentiable neural component that dynamically selects key inputs based on data-dependent importance, improving overall model efficiency.
  • It leverages attention mechanisms, score-based ranking, and Gumbel-Softmax relaxation to perform hard masking of redundant or uninformative features.
  • SIM reduces computational overhead by filtering background content, thereby enhancing downstream performance in both trajectory forecasting and multi-modal object re-identification.

A Selective Interaction Module (SIM) is a differentiable neural architectural component designed to dynamically identify and focus on salient elements—whether they are agents in social interaction prediction tasks or informative tokens in multi-modal representations—in order to improve efficiency and discriminative capacity in downstream models. SIMs have been instantiated with distinct but analogous designs in both human trajectory prediction, as the "Importance Estimator" (Urano et al., 23 Jun 2025), and in multi-modal object re-identification pipelines (Liu et al., 22 Nov 2025). Core to SIM is the idea of learning to select or mask inputs based on data-dependent importance assessed in the context of the current task and scene, typically leveraging attention mechanisms, score-based ranking, and stochastic relaxation for differentiability.

1. Design Objectives and Problem Formulation

SIMs are motivated by the observation that many downstream tasks are computationally dominated by redundant or background content—e.g., uninformative neighbors in human trajectory forecasting, or background image patches in ReID. Their main objective is to estimate, from a set of candidates X={xi}X = \{x_i\} (trajectories, tokens, etc.), a saliency or importance value sis_i for each, and to optimally select the subset contributing most to the task prediction.

For trajectory prediction, given a primary agent and N1N-1 neighbors, SIM outputs continuous scores si[0,1]s_i\in[0,1] for each neighbor and applies a mask at inference time using a threshold τ\tau (default 0.5) to select the subset provided to the trajectory predictor (Urano et al., 23 Jun 2025). For multi-modal re-identification, SIM selects top-kk informative patch tokens per modality using both intra- and inter-modal attention-derived scoring, propagating only these to the alignment and classification modules (Liu et al., 22 Nov 2025).

2. Algorithmic Architectures

Human Trajectory Prediction: Importance Estimator

SIM operates on person-specific embeddings fiRdf_i\in\mathbb{R}^d—output by an individual feature extractor—using a shallow MLP and a compact Transformer:

  1. Pre-processing: Each observed trajectory XiRTobs×2X_i\in\mathbb{R}^{T_{obs}\times 2} is mapped to fif_i.
  2. MLP Layers:
    • hi(1)=ReLU(W1fi+b1)h_i^{(1)} = \text{ReLU}(W_1 f_i + b_1), W1R64×dW_1\in\mathbb{R}^{64\times d}
    • hi(2)=ReLU(W2hi(1)+b2)h_i^{(2)} = \text{ReLU}(W_2 h_i^{(1)} + b_2), W2R64×64W_2\in\mathbb{R}^{64\times 64}
  3. Self-attention: The set {hi(2)}i=1N\{h_i^{(2)}\}_{i=1}^N is processed by a shallow Transformer ($1$–$2$ layers, hidden size 64).
  4. Final linear + sigmoid:
    • li=w3Thi+b3l_i = w_3^T h_i + b_3
    • si=σ(li)s_i = \sigma(l_i)

The outputs {si}\{s_i\} parameterize selection via a Gumbel-Softmax binary Concrete relaxation for differentiability (Urano et al., 23 Jun 2025).

Multi-Modal Re-Identification: Token Selector

For each modality m{R,N,T}m \in \{R,N,T\} (RGB, NIR, TIR):

  1. Patch-tokenization: Image ImR3×H×WI_m\in\mathbb{R}^{3\times H\times W} is split into LL non-overlapping patches; CLIP-ViT encodes [fmcls;fm1,...,fmL][f_m^{cls}; f_m^1, ..., f_m^L].
  2. Intra-modal scoring:
    • Compute attention to class token: Sm=Softmax((QmKm)/D)S_m = \text{Softmax}( (Q_m K_m^\top) / \sqrt{D} ), where Qm=fmcQ_m = f_m^c and Km=fmpK_m = f_m^p.
    • Select top-k1k_1 patches per modality.
  3. Inter-modal scoring:
    • Cross-attend class tokens from all modalities to all patch tokens.
    • For each modality, select top-k2k_2 patches highly attended by other modalities' class tokens.
  4. Union mask and selection: Final mask is union of intra- and inter-modal selections.
  5. Modal interaction: Stacked class tokens attend to all selected tokens from all modalities via multi-head cross-attention and residual FFN, yielding a fused tri-modal feature vector (Liu et al., 22 Nov 2025).

3. Mathematical Formulation

Trajectory Prediction Selection

Final selection probabilities (after Transformer): li=w3Thi+b3,si=σ(li)l_i = w_3^T h_i + b_3,\quad s_i = \sigma(l_i)

Sampling via Binary Concrete: yi=σ(logsilog(1si)+gigiτ)y_i = \sigma\left( \frac{\log s_i - \log(1 - s_i) + g_i - g_i'}{\tau} \right) with gi,giGumbel(0,1)g_i, g_i'\sim \text{Gumbel}(0,1). Optionally discretize zi=I[yi0.5]z_i = \mathbb{I}[y_i \geq 0.5]. Structural masking is applied: only Xi:zi=1X_i : z_i=1 are passed to the downstream trajectory attention block (Urano et al., 23 Jun 2025).

Multi-Modal Token Selection

Intra-modal score: Sm=Softmax(fmcWq(fmpWk)D)S_m = \text{Softmax}\left( \frac{f_m^c W_q (f_m^p W_k)^\top}{\sqrt{D}} \right) Inter-modal score: S=Softmax(QKD), Q=T([fRc;fNc;fTc]), K=C([fRp;fNp;fTp])S = \text{Softmax}\left( \frac{Q K^\top}{\sqrt{D}} \right),\ Q = \mathcal{T}([f_R^c; f_N^c; f_T^c]),\ K = \mathcal{C}([f_R^p; f_N^p; f_T^p]) Selections are based on the union of intra- and inter-modal masks.

4. Loss Functions and Regularization

  • Trajectory Prediction:
    • Trajectory MSE Loss:

    Lt=1N(TpredTobs)i=1Nt=Tobs+1TpredY^itYit2L_t = \frac{1}{N(T_{pred} - T_{obs})} \sum_{i=1}^{N} \sum_{t=T_{obs}+1}^{T_{pred}} \lVert \hat{Y}_i^t - Y_i^t \rVert^2 - Variance Loss (prevents all sis_i converging to $1$): Lv=log(Vars+ϵ)L_v = -\log(\text{Var}_s + \epsilon) - Total Loss: Ltotal=Lt+αLvL_{total} = L_t + \alpha L_v (α=1\alpha=1) (Urano et al., 23 Jun 2025).

  • Multi-Modal ReID:

5. Empirical Performance and Computational Analysis

Trajectory Prediction (JRDB Dataset)

  • Baseline Social-Trans: ADE = 0.376, FDE = 0.741
  • With SIM ("Importance Estimator"): ADE = 0.377 (+0.3%), FDE = 0.747 (+0.8%)
  • Efficiency: SIM with variance loss achieves an ~8.1% reduction in FLOPs on JRDB (1.49G to 1.37G average), scaling favorably with scene density. When variance loss is ablated, the estimator collapses (si1s_i\approx 1), yielding a net FLOPs increase due to overhead but no pruning (Urano et al., 23 Jun 2025).

Multi-Modal ReID

SIM, as part of SIGNAL (Liu et al., 22 Nov 2025), materially improves retrieval by filtering background patches. Hyperparameters include embedding dimension DD (512 or 768), k1=80k_1 = 80 (intra-modal TopK selection), and learnable parameters totaling approximately 3.15M above the visual encoder backbone.

6. Integration with Broader Pipelines

  • Human Trajectory: SIM is prepended to Social-Trans, replacing dense self-attention among all persons with attention only among selected agents, reducing O(N2)O(N^2) complexity.
  • Multi-Modal ReID: SIM sits ahead of global and local alignment modules, providing a pruned set of informative, object-centric features. The outputs directly feed to ID loss, global alignment (mean pooling), and local alignment (shift-aware deformable processing).

7. Significance, Limitations, and Implementation

SIM enables fully differentiable hard selection of relevant context, preserving end-to-end gradient flow via Gumbel-Softmax relaxation or differentiable TopK. The variance loss term is essential to prevent trivial solutions where all elements are retained, ensuring true selection behavior. SIM’s ablations demonstrate that selection without explicit regularization degenerates to the non-selective baseline, underscoring the necessity of variance or similar diversity-promoting losses.

A plausible implication is that the general SIM scheme is extensible to other domains requiring hard masking or dynamic selection—provided the crucial regularization and relaxation mechanisms are retained.

References:

  • "Selective Social-Interaction via Individual Importance for Fast Human Trajectory Prediction" (Urano et al., 23 Jun 2025)
  • "Signal: Selective Interaction and Global-local Alignment for Multi-Modal Object Re-Identification" (Liu et al., 22 Nov 2025)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Selective Interaction Module (SIM).