AIMformer: Motion-Centric Attention in Transformers

Updated 4 July 2026

AIMformer is a motion-centric attention methodology that explicitly selects and aggregates motion information across time, structure, and modalities.
It employs multi-level and locality-controlled mechanisms to enhance processing in applications like human motion prediction, video understanding, and vehicular platooning.
Empirical evaluations demonstrate that motion-aware attention designs lead to significant improvements in performance, interpretability, and robustness across diverse motion tasks.

Searching arXiv for papers explicitly using “Attention In Motion” / AIMformer terminology and closely related motion-attention works. arXiv search query: "Attention In Motion" AIMformer motion attention transformer Attention In Motion (AIMformer) denotes a family of motion-centric attention designs in which attention is used as the principal mechanism for selecting, aligning, aggregating, or conditioning motion information across time, body structure, visual space, or interacting agents. In the available literature, the name is used explicitly for transformer-based vehicular platooning security (Kalogiannis et al., 17 Dec 2025), while closely related works apply the same design logic to human motion prediction (Mao et al., 2020, Mao et al., 2021), visual motion perception (Sun et al., 2023), zero-shot motion transfer in diffusion models (Raab et al., 2024), video understanding via motion prompts (Chen et al., 2024), synchronous motion captioning (Radouane et al., 2024), and masked motion diffusion for reconstruction and in-betweening (Jiang et al., 8 Mar 2026). This suggests that AIMformer is best understood not as a single standardized architecture but as a research lineage in which attention is engineered to follow motion structure rather than to treat motion as a secondary by-product of generic sequence modeling.

1. Terminological scope and historical emergence

The earliest formulation in this lineage is motion attention for human motion prediction. "History Repeats Itself: Human Motion Prediction via Motion Attention" (Mao et al., 2020) and its multi-level extension "Multi-level Motion Attention for Human Motion Prediction" (Mao et al., 2021) replace frame-wise pose matching with attention over historical motion subsequences. Their central claim is that human motion tends to repeat itself, and that prediction should therefore retrieve relevant historical motion clips rather than compare isolated poses.

Subsequent work broadens the same principle into other motion domains. A biologically motivated two-stage model combines trainable motion energy sensing with recurrent self-attention for adaptive motion integration and segregation, explicitly targeting the V1-MT pathway of human visual motion processing (Sun et al., 2023). In motion generation and editing, "Monkey See, Monkey Do: Harnessing Self-attention in Motion Diffusion for Zero-shot Motion Transfer" treats pretrained self-attention as a latent correspondence engine and performs inference-time motion transfer by rerouting queries, keys, and values (Raab et al., 2024). In video understanding, "Motion meets Attention: Video Motion Prompts" inserts a lightweight motion prompt layer between video input and backbone model, using attention-like modulation of frame differencing maps to mitigate "blind motion extraction" (Chen et al., 2024).

The 2024 synchronous captioning model makes attention distributions explicitly controllable so that words are generated progressively in synchronization with human motion (Radouane et al., 2024). The 2025 platooning paper is the most literal use of the name, presenting "Attention in Motion: Secure Platooning via Transformer-based Misbehavior Detection" as a Transformer encoder for real-time detection in safety-critical V2X settings (Kalogiannis et al., 17 Dec 2025). The 2026 masked motion diffusion model further extends the pattern into reconstruction under occlusion, with Kinematic Attention Aggregation alternating joint-level and pose-level reasoning (Jiang et al., 8 Mar 2026).

Area	Attention role	Representative work
Human motion prediction	Retrieve repeated motion subsequences	(Mao et al., 2020, Mao et al., 2021)
Visual motion perception	Integrate and segregate motion globally	(Sun et al., 2023)
Motion diffusion editing	Latent correspondence and feature mixing	(Raab et al., 2024)
Video understanding	Motion-promoted input adaptation	(Chen et al., 2024)
Motion-language alignment	Controlled temporal synchronization	(Radouane et al., 2024)
Platooning security	Spatio-temporal anomaly modeling	(Kalogiannis et al., 17 Dec 2025)
Motion reconstruction	Joint/pose aggregation under masking	(Jiang et al., 8 Mar 2026)

2. Recurring technical principles

AIMformer-style systems share a motion-selective view of attention. Rather than using attention only as a generic sequence operator, they shape it so that the attended units correspond to motion phases, motion regions, motion subsequences, or motion-aligned cross-modal segments. In motion prediction, the query is the latest observed motion window, the keys are historical motion windows, and the values are future-containing historical subsequences (Mao et al., 2020). In MoMo, the leader’s queries are paired with the follower’s keys and values so that temporal outline is inherited from the leader while motifs remain follower-specific (Raab et al., 2024). In synchronous captioning, decoder cross-attention is restricted to a learnable temporal window and regularized so that alignments move forward monotonically (Radouane et al., 2024).

A second recurring principle is explicit separation of temporal structure from stylistic or structural detail. MoMo reports that $Q$ is more dominated by motion outline and temporal structure, whereas $K$ is more dominated by motifs; the model therefore uses the leader’s query with the follower’s key and value (Raab et al., 2024). Multi-level motion attention separates full-pose, body-part, and joint-level similarity because different motions benefit from different granularity (Mao et al., 2021). Kinematic Attention Aggregation performs structural attention over joints within a pose and temporal attention only over pose tokens, then broadcasts the updated pose representation back to joint features (Jiang et al., 8 Mar 2026).

A third principle is locality control. The captioning Transformer uses a sliding-window self-attention mask $\Gamma_i=[i-r,i+r]$ and a learnable cross-attention window $\gamma_t=[m_t-D,m_t+D]$ to avoid undesired information mixing (Radouane et al., 2024). The video motion prompt layer enforces temporal smoothness with pair-wise temporal attention variation regularization, thereby suppressing noisy frame-difference spikes (Chen et al., 2024). The platooning model uses global positional encoding with vehicle-specific temporal offsets so that vehicles entering and leaving at different times align to a shared platoon timeline (Kalogiannis et al., 17 Dec 2025).

A fourth principle is that attention may operate either inside the backbone or before it. MoMo edits self-attention inside a frozen pretrained diffusion backbone at inference time (Raab et al., 2024). The platooning detector is a Transformer encoder in which self-attention is the main modeling substrate (Kalogiannis et al., 17 Dec 2025). By contrast, Video Motion Prompts do not replace the backbone’s internal attention; they precondition the input frames before token embedding, functioning as a plug-and-play adapter (Chen et al., 2024). This distinction is important because AIMformer-style motion reasoning is not confined to one architectural insertion point.

3. Representative mathematical formulations

A defining formulation in motion prediction is subsequence-level motion attention. Given a query $\mathbf{q}$ from the latest observed motion clip and keys $\mathbf{k}_i$ from historical motion clips, attention weights are normalized by the sum of dot products rather than by softmax:

$a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$

The models apply ReLU in $f_q^p$ and $f_k^p$ so attention scores stay non-negative, and the multi-level version performs this at joint, part, and full-pose levels (Mao et al., 2020, Mao et al., 2021).

MoMo’s central operation is mixed self-attention during denoising. With leader query and follower key/value, the output stream uses

$OH^{out}=IH^{out}+\text{softmax}\left(\frac{Q^{ldr}\cdot K^{flw^T}}{\sqrt{IH_n}}\right)V^{flw}.$

The paper describes this as a latent semantic correspondence: the leader query identifies which temporal or semantic region should be matched, the follower key identifies the nearest semantically similar region, and the follower value supplies the content written into the output (Raab et al., 2024).

The video motion prompt layer converts frame differencing maps $K$ 0 into attention maps using a modified Sigmoid with learnable slope and shift:

$K$ 1

With $K$ 2, $K$ 3, $K$ 4, and $K$ 5, this motion-guided modulation is followed by the regularizer

$K$ 6

and then by a Hadamard product with the original frames to form Video Motion Prompts (Chen et al., 2024).

Controlled synchronous captioning introduces an explicit alignment center

$K$ 7

where $K$ 8 is the cross-attention weight from the current word state to motion frame $K$ 9. The losses

$\Gamma_i=[i-r,i+r]$ 0

with $\Gamma_i=[i-r,i+r]$ 1 push the attention center toward the sequence start and enforce forward motion over time (Radouane et al., 2024).

The platooning AIMformer defines global positional encoding with vehicle-specific offsets as

$\Gamma_i=[i-r,i+r]$ 2

so that asynchronous vehicle trajectories align to a common global platoon timeline. Its Precision-Focused BCE loss selectively increases the cost of confident false positives, decoupling false-positive suppression from positive-class weighting (Kalogiannis et al., 17 Dec 2025).

4. Domain-specific instantiations

In human motion prediction, AIMformer-style models treat the motion history as a memory bank of subsequences. The attention module retrieves recurrent motion patterns, and a residual GCN with learnable adjacency refines the attended motion prior in DCT space before inverse transformation back to future poses (Mao et al., 2020). The multi-level extension shows that full-pose attention is especially useful for periodical motions such as walking, joint-level attention helps when different joints have different rhythms, and part-level attention resolves ambiguity when joint-level history is too local and noisy (Mao et al., 2021).

In motion diffusion editing, MoMo addresses unpaired motion transfer rather than ordinary synthesis. The input consists of a leader motion $\Gamma_i=[i-r,i+r]$ 3 and a follower motion $\Gamma_i=[i-r,i+r]$ 4, and the output $\Gamma_i=[i-r,i+r]$ 5 should follow the leader’s rhythm, timing, and high-level structure while preserving the follower’s style, posture, and local gesture patterns. The method uses DDIM inversion for real motions and deterministic denoising for both real and generated motions, allowing editing at inference time without training or finetuning (Raab et al., 2024).

In video understanding, Video Motion Prompts target action recognition on HMDB-51, FineGym, and MPII Cooking 2. The method identifies a problem termed "blind motion extraction": traditional differencing-based pipelines capture whatever changes between frames, including camera motion and background noise, without motion-guided selection. The motion prompt layer acts as an adapter between raw video and backbones such as TimeSformer, SlowFast, and X3D by highlighting motion-relevant regions while retaining appearance cues (Chen et al., 2024).

In synchronous motion captioning, controlled attention converts motion-text alignment from a by-product into a training objective. A single-layer encoder-decoder Transformer uses masked self-attention over motion frames, one-head masked cross-attention from words to frames, and monotonicity losses so that the generated sentence progresses with the motion stream. The same controlled cross-attention is then used as a soft temporal locator for segmentation, localization, and aligned sign-language transcription scenarios (Radouane et al., 2024).

In vehicular platooning, AIMformer addresses a distinct but structurally analogous problem: authenticated vehicles may inject falsified position, speed, or acceleration messages into a safety-critical cooperative control loop. Here attention models intra-vehicle temporal dynamics and inter-vehicle spatial correlations, including join and exit maneuvers where topology changes and the system is already vulnerable (Kalogiannis et al., 17 Dec 2025).

In masked motion diffusion, MMDM specializes attention to incomplete or low-confidence 3D motion data. Kinematic Attention Aggregation alternates structural attention over joints and temporal attention over pose summaries, making it practical to learn context-adaptive motion priors for motion completion, refinement, and in-betweening without changing the architecture across tasks (Jiang et al., 8 Mar 2026).

5. Empirical findings across the literature

The human motion prediction line reports a large gap between frame-wise attention and motion attention. On the reported H3.6M comparison, Frame-wise Attention yields $\Gamma_i=[i-r,i+r]$ 6, whereas Motion Attention yields $\Gamma_i=[i-r,i+r]$ 7, supporting the claim that subsequences capture direction and dynamics more effectively than isolated poses (Mao et al., 2021). The later paper also reports that on Walking at $\Gamma_i=[i-r,i+r]$ 8 ms, the error is about $\Gamma_i=[i-r,i+r]$ 9 lower than LTD-10-10 (Mao et al., 2021).

MoMo reports the best overall quality among the compared transfer baselines on MTB. MoMo Gen. scores FID $\gamma_t=[m_t-D,m_t+D]$ 0 and R-precision $\gamma_t=[m_t-D,m_t+D]$ 1, while MoMo Inv. scores FID $\gamma_t=[m_t-D,m_t+D]$ 2 and R-precision $\gamma_t=[m_t-D,m_t+D]$ 3. The paper states that naive nearest-neighbor methods can match follower similarity but produce jittery, unnatural motion and worse FID, that MoST stays too close to the leader and struggles with unseen styles, and that MDM inpainting is constrained by fixed masks (Raab et al., 2024).

The motion-prompt layer shows consistent gains across multiple backbones and datasets. Using TimeSformer + VMPs on HMDB-51 improves Top-1 accuracy by $\gamma_t=[m_t-D,m_t+D]$ 4, $\gamma_t=[m_t-D,m_t+D]$ 5, and $\gamma_t=[m_t-D,m_t+D]$ 6 across the three splits, for an average gain of $\gamma_t=[m_t-D,m_t+D]$ 7. On FineGym, TimeSformer + VMPs gains $\gamma_t=[m_t-D,m_t+D]$ 8 Top-1. On MPII Cooking 2, SlowFast gains $\gamma_t=[m_t-D,m_t+D]$ 9 Top-1, X3D gains $\mathbf{q}$ 0 Top-1, and TimeSformer gains $\mathbf{q}$ 1 Top-1 and $\mathbf{q}$ 2 Top-5; one variant reaches $\mathbf{q}$ 3 over the baseline (Chen et al., 2024).

Controlled synchronous captioning attains competitive caption quality while improving synchronization. On KIT-ML it reports BLEU@1 $\mathbf{q}$ 4, BLEU@4 $\mathbf{q}$ 5, ROUGE-L $\mathbf{q}$ 6, CIDEr $\mathbf{q}$ 7, and BERTScore $\mathbf{q}$ 8. On HumanML3D it reports BLEU@1 $\mathbf{q}$ 9, BLEU@4 $\mathbf{k}_i$ 0, ROUGE-L $\mathbf{k}_i$ 1, CIDEr $\mathbf{k}_i$ 2, and BERTScore $\mathbf{k}_i$ 3. In the masking sweep on HumanML3D, $\mathbf{k}_i$ 4 gives BLEU@4 $\mathbf{k}_i$ 5 and IoU $\mathbf{k}_i$ 6, while full context $\mathbf{k}_i$ 7 reduces IoU to $\mathbf{k}_i$ 8, IoP to $\mathbf{k}_i$ 9, and Element of to $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 0; the paper concludes that $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 1 is a good tradeoff between caption quality and synchronization (Radouane et al., 2024).

The platooning AIMformer reports performance $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 2 and AUC $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 3– $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 4 across controllers and deployment modes. In the $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 5 global-attention variant, F1-scores for controllers C1–C4 are $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 6, compared with $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 7 for the $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 8 variant. The same paper reports near-perfect precision $a_i=\frac{\mathbf{q}\mathbf{k}_i^T}{\sum_{j=1}^{N-M-T+1}\mathbf{q}\mathbf{k}_j^T}, \qquad \mathbf{U}=\sum_{i=1}^{N-M-T+1} a_i \mathbf{V}_i .$ 9 across controllers for the $f_q^p$ 0 setup, sub-millisecond inference after optimization, and average inference times of $f_q^p$ 1 ms for TFLite int8 individual models and $f_q^p$ 2 ms for TFLite global models on a Jetson Orin Nano Super Developer Kit (Kalogiannis et al., 17 Dec 2025).

MMDM reports strong results on motion reconstruction and in-betweening. On BABEL-TEACH in-betweening it achieves L2-P $f_q^p$ 3, L2-Q $f_q^p$ 4, and NPSS $f_q^p$ 5, outperforming RMIB, CMIB, MDM, and GMD. The paper also reports best average PCP on Shelf and Campus and strong results on BUMocap and BUMocap-X (Jiang et al., 8 Mar 2026). In visual motion processing, the recurrent self-attention stage produces the best partial correlations with human judgments on Sintel among the compared methods, while the model as a whole reproduces the aperture problem, barber-pole illusion, and missing-fundamental reversal (Sun et al., 2023).

6. Limitations, misconceptions, and open directions

A recurrent misconception is that attention maps are automatically interpretable. Several papers explicitly dispute this. The synchronous captioning work reports that without masking, attention becomes noisy and misaligned, and that with multiple layers synchronization gets worse because receptive fields expand and the final representation becomes too mixed (Radouane et al., 2024). The video motion prompt work similarly shows that without temporal regularization the attention maps are noisy, especially in background regions (Chen et al., 2024).

A second misconception is that AIMformer necessarily means a Transformer that performs all motion reasoning internally. The literature shows at least three regimes: attention as a retrieval mechanism over motion clips (Mao et al., 2020), attention as an input adapter before the backbone (Chen et al., 2024), and attention as inference-time routing inside a frozen diffusion model (Raab et al., 2024). This suggests that the defining property is not a single architecture class but motion-aware control of information flow.

Task-specific limitations remain substantial. MoMo cannot invent missing structure if the leader’s outline is absent from the follower in a basic way; for example, if the leader walks but the follower is stationary, transfer may fail. Because the implementation uses DDIM, the same leader-follower pair yields no diversity in the output (Raab et al., 2024). Video Motion Prompts still depend on frame differencing, so very noisy videos remain challenging, and datasets with strong camera motion may cause the attention maps to emphasize background context as well as action motion (Chen et al., 2024). MMDM notes that diffusion is computationally expensive and that different tasks may still require separate retraining or adaptation (Jiang et al., 8 Mar 2026).

In safety-critical deployment, limitations become operational rather than purely representational. The platooning AIMformer is evaluated in simulation, focuses on known attack patterns, and does not fully quantify the operational impact of false alarms on platoon behavior; mixed autonomy and heterogeneous fleets remain open problems (Kalogiannis et al., 17 Dec 2025). In motion prediction, the literature shows that different attention granularities behave differently across actions and metrics, with post-fusion of pose-, part-, and joint-level attention performing best in the multi-level model (Mao et al., 2021).

Taken together, these works indicate that Attention In Motion is a general methodological stance: motion should be attended to at the level at which it is semantically and dynamically coherent. In some cases that level is a historical motion subsequence, in others a latent diffusion feature, a frame-difference-derived prompt, a word-aligned motion segment, a globally offset vehicle timeline, or a pose token that mediates between joints and time. The unifying implication is that performance and interpretability improve when attention is shaped by motion structure itself rather than left to emerge incidentally from generic sequence processing.