MotionPerceiver: Transformer Motion & Forecasting
- MotionPerceiver is a dual-system architecture featuring transformer-based models for both biological motion recognition and autonomous occupancy forecasting.
- It employs patchwise optical flow extraction with competitive binding to generate invariant motion representations that generalize to out-of-distribution point-light stimuli.
- The occupancy forecasting model uses a recursive latent state and streaming inference, ensuring bounded compute and real-time performance on embedded automotive hardware.
MotionPerceiver denotes two independent, state-of-the-art transformer-based neural architectures, each addressing a distinct class of motion perception problems: (1) robust recognition and generalization of biological motion from video stimuli using only patchwise optical flow inputs (Han et al., 2024), and (2) real-time probabilistic occupancy forecasting of road scenes for autonomous-vehicle planning, targeting embedded hardware efficiency (Ferenczi et al., 2023). Both systems leverage modular attention mechanisms and compact latent states, but differ fundamentally in their input modalities, representational objectives, and practical domains.
1. Models and Domains
MotionPerceiver for biological motion perception (often abbreviated "MP" in its original literature) is designed for action recognition from raw video. Its singularity lies in its ability to generalize human action recognition to out-of-distribution point-light stimuli, matching human judgment across psychophysically defined conditions without explicit retraining on such data (Han et al., 2024).
MotionPerceiver for occupancy forecasting targets real-time autonomous driving scenarios. It produces dynamic occupancy probabilities for any queried scene location, using a streaming, recursive latent-state approach. This system is engineered for bounded memory and FLOPs per step, enabling deployment on edge hardware such as Nvidia Xavier AGX (Ferenczi et al., 2023).
2. Architectural Principles
Biological Motion Perceiver
- Patchwise Optical Flow Extraction: Each video of frames is featurized by a frozen ViT (DINO). Patch features are extracted, where is number of patches and channels. Patch coordinates provide spatial locations.
- Affinity-Based Patchwise Flow: Optical flow between frames is computed via affinity matrices , projecting patch positions, resulting in .
- Competitive Binding: Six learnable "flow snapshot neurons" (slots) (with , ) identify prototypical flow patterns by slot attention with contrastive diversity constraints.
- Invariant Motion Features: A parallel branch aggregates and encodes directionally pooled flows into a motion-invariant representation using self-attention.
- Multi-scale Fusion: Flows are computed at multiple temporal strides (), concatenated, passed through slot-fusion, temporal self-attention, and pooled for downstream action classification.
Occupancy Forecasting MotionPerceiver
- Data Tokenization: At each sensor timestep , inputs are tokenized into agent tokens (pose, velocity, dimensions), signal tokens (traffic signals as position+state), and road-graph tokens (rasterized lane splines passed through a CNN).
- Recursive Latent State: The global scene state is updated recursively via (i) time propagation (multi-layer transformer self-attention, ), and (ii) observation correction (cross-attention, ) integrating new agent, signal, and road tokens.
- Occupancy Querying: At any , occupancy probability at arbitrary points can be queried locally via cross-attention from learned sinusoidal embeddings of positions into , followed by an MLP.
- Streaming Inference: Input history is not re-encoded; the constant-size recurrent latent state guarantees bounded per-step compute and memory.
3. Learning Objectives and Training
Biological Motion Perceiver
MP is trained with the following objectives:
- Slot Diversity (Contrastive Walk Loss): , promoting diverse prototypical flow representations.
- Cross-Entropy Losses: Action label prediction is supervised by flow-based, invariant, and fused cross-entropy losses; total loss is with .
- Ablations: Removing motion-invariant or flow-snapshot neurons reduces generalization to point-light displays (e.g., "J-6P" (joint-6-point) condition drops from to ).
Occupancy Forecasting MotionPerceiver
- Focal Loss: To address imbalanced occupancy targets, the per-pixel focal loss is employed with , .
- Gradient Detachment: Gradients are detached between timesteps to maintain the Markov property.
- Test-Time Calibration: Logit scaling ( for negative logits) can be applied for improved uncertainty calibration.
4. Evaluation Protocols and Performance
Biological Motion Perceiver
- Benchmark: 62,656 videos, 10 actions, 24 BMP (biological motion perception) conditions manipulating temporal order, resolution, dot count ("joints"), dot lifetime, and viewpoint.
- Top-1 Accuracy: On the most information-limited J-6P condition (6 joints), MP achieves versus best prior model (MViT) at ; improvement on SP-8P-1LT is .
- Generalization: MP attains high performance on standard CV joint-based datasets (NTU RGB+D60-Joint vs 6–7\%; NW-UCLA-Joint vs 10\%) without BMP-task retraining.
Occupancy Forecasting MotionPerceiver
- Waymo Open Motion Dataset (WOMD): Soft IoU mean over 8 s is $0.523$, outperforming VectorFlow ($0.488$), STrajNet ($0.491$), HOPE ($0.235$), LookAround ($0.234$) with only $10.7$M parameters.
- AUC: ROC-AUC over 8s horizon is $0.770$.
- Efficiency: All modules benchmarked on Nvidia Xavier AGX yield Hz streaming inference, with constant time/space complexity per timestep.
| Model | Params | Soft IoU | AUC |
|---|---|---|---|
| MotionPerceiver | 10.7M | 0.523 | 0.770 |
| VectorFlow | 17.1M | 0.488 | N/A |
| STrajNet | 14.5M | 0.491 | N/A |
| HOPE | 81M | 0.235 | N/A |
5. Psychophysics and Human Alignment
Biological Motion Perceiver was evaluated in extensive psychophysical regimes:
- Human Top-1 Accuracy: on J-6P, nearly on RGB videos.
- Behavioral Concordance: Model and humans share error profiles: greater sensitivity to frame reversal over shuffle, robust performance down to 5-joint displays, and minimal dot lifetime dependency.
- Statistical Alignment: MP vs. human accuracy Pearson across manipulation conditions (next-best model ); slope in accuracy-per-condition plots.
- Ablations: Removal of core modules sharply degrades human-like performance and generalization.
6. Embedded and Computational Considerations
Streaming architecture in occupancy forecasting allows:
- Constant Compute: Both per-timestep memory and runtime scale independently of history length and agent count.
- Edge Device Efficiency: Profiling on Xavier AGX yields:
- StateInit: $1.202$ms
- Propagate: $0.905$ms
- Agent/Signal Obs: $0.379$ms/$0.460$ms
- RoadContext: $0.395$ms
- OccupancyQuery (): $9.260$ms
- Total per-step latency for 8s forecast: ms (Hz)
- Model Size: Latent state floats ($128$kB); weights $42$MB in FP16.
7. Significance and Distinguishing Features
MotionPerceiver (biological motion) is the first model to match human error patterning and top-level generalization across diverse, psychophysically relevant manipulations of point-light action stimuli, enabled by extracting flow-based invariant motion representations and learned prototypical "snapshot" neurons (Han et al., 2024).
MotionPerceiver (occupancy forecasting) demonstrates a general, efficient, streaming paradigm for scene-centric motion prediction, contrasting with fixed-history, agent-centric approaches. The system's local query-based emission and recursive transformer update afford strict resource bounds and real-time feasibility on embedded automotive platforms. Its architecture achieves state-of-the-art soft IoU accuracy on WOMD with reduced parameter counts (Ferenczi et al., 2023).