MotionPerceiver: Transformer Motion & Forecasting

Updated 29 January 2026

MotionPerceiver is a dual-system architecture featuring transformer-based models for both biological motion recognition and autonomous occupancy forecasting.
It employs patchwise optical flow extraction with competitive binding to generate invariant motion representations that generalize to out-of-distribution point-light stimuli.
The occupancy forecasting model uses a recursive latent state and streaming inference, ensuring bounded compute and real-time performance on embedded automotive hardware.

MotionPerceiver denotes two independent, state-of-the-art transformer-based neural architectures, each addressing a distinct class of motion perception problems: (1) robust recognition and generalization of biological motion from video stimuli using only patchwise optical flow inputs (Han et al., 2024), and (2) real-time probabilistic occupancy forecasting of road scenes for autonomous-vehicle planning, targeting embedded hardware efficiency (Ferenczi et al., 2023). Both systems leverage modular attention mechanisms and compact latent states, but differ fundamentally in their input modalities, representational objectives, and practical domains.

1. Models and Domains

MotionPerceiver for biological motion perception (often abbreviated "MP" in its original literature) is designed for action recognition from raw video. Its singularity lies in its ability to generalize human action recognition to out-of-distribution point-light stimuli, matching human judgment across psychophysically defined conditions without explicit retraining on such data (Han et al., 2024).

MotionPerceiver for occupancy forecasting targets real-time autonomous driving scenarios. It produces dynamic occupancy probabilities for any queried scene location, using a streaming, recursive latent-state approach. This system is engineered for bounded memory and FLOPs per step, enabling deployment on edge hardware such as Nvidia Xavier AGX (Ferenczi et al., 2023).

2. Architectural Principles

Biological Motion Perceiver

Patchwise Optical Flow Extraction: Each video of $T$ frames is featurized by a frozen ViT (DINO). Patch features $F_t \in \mathbb{R}^{N\times C}$ are extracted, where $N$ is number of patches and $C$ channels. Patch coordinates $G \in \mathbb{R}^{N\times 2}$ provide spatial locations.
Affinity-Based Patchwise Flow: Optical flow between frames $i \rightarrow j$ is computed via affinity matrices $Q_{(F_i,F_j)}$ , projecting patch positions, resulting in $\hat{O} \in \mathbb{R}^{T\times N\times 2\times (T-1)}$ .
Competitive Binding: Six learnable "flow snapshot neurons" (slots) $Z \in \mathbb{R}^{K\times D}$ (with $K=6$ , $D=2\cdot(T-1)$ ) identify prototypical flow patterns by slot attention with contrastive diversity constraints.
Invariant Motion Features: A parallel branch aggregates and encodes directionally pooled flows into a motion-invariant representation using self-attention.
Multi-scale Fusion: Flows are computed at multiple temporal strides ( $s\in\{1,2,4,8\}$ ), concatenated, passed through slot-fusion, temporal self-attention, and pooled for downstream action classification.

Occupancy Forecasting MotionPerceiver

Data Tokenization: At each sensor timestep $t$ , inputs are tokenized into agent tokens (pose, velocity, dimensions), signal tokens (traffic signals as position+state), and road-graph tokens (rasterized lane splines passed through a CNN).
Recursive Latent State: The global scene state $\mathbf{S}_t \in \mathbb{R}^{N_L \times C_L}$ is updated recursively via (i) time propagation (multi-layer transformer self-attention, $\mathcal{F}$ ), and (ii) observation correction (cross-attention, $\mathcal{U}$ ) integrating new agent, signal, and road tokens.
Occupancy Querying: At any $t$ , occupancy probability at arbitrary points can be queried locally via cross-attention from learned sinusoidal embeddings of positions into $\mathbf{S}_{t|t}$ , followed by an MLP.
Streaming Inference: Input history is not re-encoded; the constant-size recurrent latent state guarantees bounded per-step compute and memory.

3. Learning Objectives and Training

Biological Motion Perceiver

MP is trained with the following objectives:

Slot Diversity (Contrastive Walk Loss): $L_{slot} = CE(Q_{(Z,\hat{O})} Q_{(\hat{O},Z)}, I_K)$ , promoting diverse prototypical flow representations.
Cross-Entropy Losses: Action label prediction is supervised by flow-based, invariant, and fused cross-entropy losses; total loss is $L = \alpha L_{slot} + L_{flow} + L_{invar} + L_{fuse}$ with $\alpha=10$ .
Ablations: Removing motion-invariant or flow-snapshot neurons reduces generalization to point-light displays (e.g., "J-6P" (joint-6-point) condition drops from $69.0\%$ to $42.5\%$ ).

Occupancy Forecasting MotionPerceiver

Focal Loss: To address imbalanced occupancy targets, the per-pixel focal loss $\mathcal{L}_f$ is employed with $\alpha=0.75$ , $\gamma=2$ .
Gradient Detachment: Gradients are detached between timesteps to maintain the Markov property.
Test-Time Calibration: Logit scaling ( $\beta=2$ for negative logits) can be applied for improved uncertainty calibration.

4. Evaluation Protocols and Performance

Biological Motion Perceiver

Benchmark: 62,656 videos, 10 actions, 24 BMP (biological motion perception) conditions manipulating temporal order, resolution, dot count ("joints"), dot lifetime, and viewpoint.
Top-1 Accuracy: On the most information-limited J-6P condition (6 joints), MP achieves $69.0\%$ versus best prior model (MViT) at $53.0\%$ ; improvement on SP-8P-1LT is $+29.4\%$ .
Generalization: MP attains high performance on standard CV joint-based datasets (NTU RGB+D60-Joint $20.38\%$ vs $\sim$ 6–7\%; NW-UCLA-Joint $42.83\%$ vs $\sim$ 10\%) without BMP-task retraining.

Occupancy Forecasting MotionPerceiver

Waymo Open Motion Dataset (WOMD): Soft IoU mean over 8 s is $0.523$, outperforming VectorFlow ($0.488$), STrajNet ($0.491$), HOPE ($0.235$), LookAround ($0.234$) with only $10.7$M parameters.
AUC: ROC-AUC over 8s horizon is $0.770$.
Efficiency: All modules benchmarked on Nvidia Xavier AGX yield $>30$ Hz streaming inference, with constant time/space complexity per timestep.

Model	Params	Soft IoU	AUC
MotionPerceiver	10.7M	0.523	0.770
VectorFlow	17.1M	0.488	N/A
STrajNet	14.5M	0.491	N/A
HOPE	81M	0.235	N/A

5. Psychophysics and Human Alignment

Biological Motion Perceiver was evaluated in extensive psychophysical regimes:

Human Top-1 Accuracy: $86\%$ on J-6P, nearly $98\%$ on RGB videos.
Behavioral Concordance: Model and humans share error profiles: greater sensitivity to frame reversal over shuffle, robust performance down to 5-joint displays, and minimal dot lifetime dependency.
Statistical Alignment: MP vs. human accuracy Pearson $r=0.85$ across manipulation conditions (next-best model $r=0.60$ ); slope $\approx 1$ in accuracy-per-condition plots.
Ablations: Removal of core modules sharply degrades human-like performance and generalization.

6. Embedded and Computational Considerations

Streaming architecture in occupancy forecasting allows:

Constant Compute: Both per-timestep memory and runtime scale independently of history length and agent count.
Edge Device Efficiency: Profiling on Xavier AGX yields:
- StateInit: $1.202$ms
- Propagate: $0.905$ms
- Agent/Signal Obs: $0.379$ms/$0.460$ms
- RoadContext: $0.395$ms
- OccupancyQuery ( $200 \times 200$ ): $9.260$ms
- Total per-step latency for 8s forecast: $\sim 10.4$ ms ( $\sim 96$ Hz)
Model Size: Latent state $128 \times 256$ floats ($128$kB); weights $42$MB in FP16.

7. Significance and Distinguishing Features

MotionPerceiver (biological motion) is the first model to match human error patterning and top-level generalization across diverse, psychophysically relevant manipulations of point-light action stimuli, enabled by extracting flow-based invariant motion representations and learned prototypical "snapshot" neurons (Han et al., 2024).

MotionPerceiver (occupancy forecasting) demonstrates a general, efficient, streaming paradigm for scene-centric motion prediction, contrasting with fixed-history, agent-centric approaches. The system's local query-based emission and recursive transformer update afford strict resource bounds and real-time feasibility on embedded automotive platforms. Its architecture achieves state-of-the-art soft IoU accuracy on WOMD with reduced parameter counts (Ferenczi et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception (2024)

Motion Perceiver: Real-Time Occupancy Forecasting for Embedded Systems (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MotionPerceiver.

MotionPerceiver: Transformer Motion & Forecasting

1. Models and Domains

2. Architectural Principles

Biological Motion Perceiver

Occupancy Forecasting MotionPerceiver

3. Learning Objectives and Training

Biological Motion Perceiver

Occupancy Forecasting MotionPerceiver

4. Evaluation Protocols and Performance

Biological Motion Perceiver

Occupancy Forecasting MotionPerceiver

5. Psychophysics and Human Alignment

6. Embedded and Computational Considerations

7. Significance and Distinguishing Features

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics