Papers
Topics
Authors
Recent
Search
2000 character limit reached

MotionPerceiver: Transformer Motion & Forecasting

Updated 29 January 2026
  • MotionPerceiver is a dual-system architecture featuring transformer-based models for both biological motion recognition and autonomous occupancy forecasting.
  • It employs patchwise optical flow extraction with competitive binding to generate invariant motion representations that generalize to out-of-distribution point-light stimuli.
  • The occupancy forecasting model uses a recursive latent state and streaming inference, ensuring bounded compute and real-time performance on embedded automotive hardware.

MotionPerceiver denotes two independent, state-of-the-art transformer-based neural architectures, each addressing a distinct class of motion perception problems: (1) robust recognition and generalization of biological motion from video stimuli using only patchwise optical flow inputs (Han et al., 2024), and (2) real-time probabilistic occupancy forecasting of road scenes for autonomous-vehicle planning, targeting embedded hardware efficiency (Ferenczi et al., 2023). Both systems leverage modular attention mechanisms and compact latent states, but differ fundamentally in their input modalities, representational objectives, and practical domains.

1. Models and Domains

MotionPerceiver for biological motion perception (often abbreviated "MP" in its original literature) is designed for action recognition from raw video. Its singularity lies in its ability to generalize human action recognition to out-of-distribution point-light stimuli, matching human judgment across psychophysically defined conditions without explicit retraining on such data (Han et al., 2024).

MotionPerceiver for occupancy forecasting targets real-time autonomous driving scenarios. It produces dynamic occupancy probabilities for any queried scene location, using a streaming, recursive latent-state approach. This system is engineered for bounded memory and FLOPs per step, enabling deployment on edge hardware such as Nvidia Xavier AGX (Ferenczi et al., 2023).

2. Architectural Principles

Biological Motion Perceiver

  • Patchwise Optical Flow Extraction: Each video of TT frames is featurized by a frozen ViT (DINO). Patch features Ft∈RN×CF_t \in \mathbb{R}^{N\times C} are extracted, where NN is number of patches and CC channels. Patch coordinates G∈RN×2G \in \mathbb{R}^{N\times 2} provide spatial locations.
  • Affinity-Based Patchwise Flow: Optical flow between frames i→ji \rightarrow j is computed via affinity matrices Q(Fi,Fj)Q_{(F_i,F_j)}, projecting patch positions, resulting in O^∈RT×N×2×(T−1)\hat{O} \in \mathbb{R}^{T\times N\times 2\times (T-1)}.
  • Competitive Binding: Six learnable "flow snapshot neurons" (slots) Z∈RK×DZ \in \mathbb{R}^{K\times D} (with K=6K=6, D=2â‹…(T−1)D=2\cdot(T-1)) identify prototypical flow patterns by slot attention with contrastive diversity constraints.
  • Invariant Motion Features: A parallel branch aggregates and encodes directionally pooled flows into a motion-invariant representation using self-attention.
  • Multi-scale Fusion: Flows are computed at multiple temporal strides (s∈{1,2,4,8}s\in\{1,2,4,8\}), concatenated, passed through slot-fusion, temporal self-attention, and pooled for downstream action classification.

Occupancy Forecasting MotionPerceiver

  • Data Tokenization: At each sensor timestep tt, inputs are tokenized into agent tokens (pose, velocity, dimensions), signal tokens (traffic signals as position+state), and road-graph tokens (rasterized lane splines passed through a CNN).
  • Recursive Latent State: The global scene state St∈RNL×CL\mathbf{S}_t \in \mathbb{R}^{N_L \times C_L} is updated recursively via (i) time propagation (multi-layer transformer self-attention, F\mathcal{F}), and (ii) observation correction (cross-attention, U\mathcal{U}) integrating new agent, signal, and road tokens.
  • Occupancy Querying: At any tt, occupancy probability at arbitrary points can be queried locally via cross-attention from learned sinusoidal embeddings of positions into St∣t\mathbf{S}_{t|t}, followed by an MLP.
  • Streaming Inference: Input history is not re-encoded; the constant-size recurrent latent state guarantees bounded per-step compute and memory.

3. Learning Objectives and Training

Biological Motion Perceiver

MP is trained with the following objectives:

  • Slot Diversity (Contrastive Walk Loss): Lslot=CE(Q(Z,O^)Q(O^,Z),IK)L_{slot} = CE(Q_{(Z,\hat{O})} Q_{(\hat{O},Z)}, I_K), promoting diverse prototypical flow representations.
  • Cross-Entropy Losses: Action label prediction is supervised by flow-based, invariant, and fused cross-entropy losses; total loss is L=αLslot+Lflow+Linvar+LfuseL = \alpha L_{slot} + L_{flow} + L_{invar} + L_{fuse} with α=10\alpha=10.
  • Ablations: Removing motion-invariant or flow-snapshot neurons reduces generalization to point-light displays (e.g., "J-6P" (joint-6-point) condition drops from 69.0%69.0\% to 42.5%42.5\%).

Occupancy Forecasting MotionPerceiver

  • Focal Loss: To address imbalanced occupancy targets, the per-pixel focal loss Lf\mathcal{L}_f is employed with α=0.75\alpha=0.75, γ=2\gamma=2.
  • Gradient Detachment: Gradients are detached between timesteps to maintain the Markov property.
  • Test-Time Calibration: Logit scaling (β=2\beta=2 for negative logits) can be applied for improved uncertainty calibration.

4. Evaluation Protocols and Performance

Biological Motion Perceiver

  • Benchmark: 62,656 videos, 10 actions, 24 BMP (biological motion perception) conditions manipulating temporal order, resolution, dot count ("joints"), dot lifetime, and viewpoint.
  • Top-1 Accuracy: On the most information-limited J-6P condition (6 joints), MP achieves 69.0%69.0\% versus best prior model (MViT) at 53.0%53.0\%; improvement on SP-8P-1LT is +29.4%+29.4\%.
  • Generalization: MP attains high performance on standard CV joint-based datasets (NTU RGB+D60-Joint 20.38%20.38\% vs ∼\sim6–7\%; NW-UCLA-Joint 42.83%42.83\% vs ∼\sim10\%) without BMP-task retraining.

Occupancy Forecasting MotionPerceiver

  • Waymo Open Motion Dataset (WOMD): Soft IoU mean over 8 s is $0.523$, outperforming VectorFlow ($0.488$), STrajNet ($0.491$), HOPE ($0.235$), LookAround ($0.234$) with only $10.7$M parameters.
  • AUC: ROC-AUC over 8s horizon is $0.770$.
  • Efficiency: All modules benchmarked on Nvidia Xavier AGX yield >30>30Hz streaming inference, with constant time/space complexity per timestep.
Model Params Soft IoU AUC
MotionPerceiver 10.7M 0.523 0.770
VectorFlow 17.1M 0.488 N/A
STrajNet 14.5M 0.491 N/A
HOPE 81M 0.235 N/A

5. Psychophysics and Human Alignment

Biological Motion Perceiver was evaluated in extensive psychophysical regimes:

  • Human Top-1 Accuracy: 86%86\% on J-6P, nearly 98%98\% on RGB videos.
  • Behavioral Concordance: Model and humans share error profiles: greater sensitivity to frame reversal over shuffle, robust performance down to 5-joint displays, and minimal dot lifetime dependency.
  • Statistical Alignment: MP vs. human accuracy Pearson r=0.85r=0.85 across manipulation conditions (next-best model r=0.60r=0.60); slope ≈1\approx 1 in accuracy-per-condition plots.
  • Ablations: Removal of core modules sharply degrades human-like performance and generalization.

6. Embedded and Computational Considerations

Streaming architecture in occupancy forecasting allows:

  • Constant Compute: Both per-timestep memory and runtime scale independently of history length and agent count.
  • Edge Device Efficiency: Profiling on Xavier AGX yields:
    • StateInit: $1.202$ms
    • Propagate: $0.905$ms
    • Agent/Signal Obs: $0.379$ms/$0.460$ms
    • RoadContext: $0.395$ms
    • OccupancyQuery (200×200200 \times 200): $9.260$ms
    • Total per-step latency for 8s forecast: ∼10.4\sim 10.4ms (∼96\sim 96Hz)
  • Model Size: Latent state 128×256128 \times 256 floats ($128$kB); weights $42$MB in FP16.

7. Significance and Distinguishing Features

MotionPerceiver (biological motion) is the first model to match human error patterning and top-level generalization across diverse, psychophysically relevant manipulations of point-light action stimuli, enabled by extracting flow-based invariant motion representations and learned prototypical "snapshot" neurons (Han et al., 2024).

MotionPerceiver (occupancy forecasting) demonstrates a general, efficient, streaming paradigm for scene-centric motion prediction, contrasting with fixed-history, agent-centric approaches. The system's local query-based emission and recursive transformer update afford strict resource bounds and real-time feasibility on embedded automotive platforms. Its architecture achieves state-of-the-art soft IoU accuracy on WOMD with reduced parameter counts (Ferenczi et al., 2023).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MotionPerceiver.