Motus: Unified Motion Frameworks

Updated 3 July 2026

Motus is a framework that models complex, non-rigid motion using compressed, low-rank and latent representations.
It enables efficient real-time tracking in MRI, robotics, and wildlife monitoring by optimizing deformation fields.
The approach integrates diverse data modalities with advanced parameterizations, boosting performance in both clinical and simulation settings.

Motus encompasses a family of frameworks, models, and systems that address the measurement, representation, and modeling of motion across domains including magnetic resonance imaging (MRI), robotics, multi-object tracking, human–motion language modeling, and wildlife tracking. The central methodology in several lines of work is the mathematical modeling of complex, often non-rigid, time-varying deformations—whether for dynamic medical imaging, embodied agents, or physical/environmental monitoring—using compressed, low-rank, or latent representations that enable efficient estimation, generation, and interpretation of motion from high-dimensional, heterogeneously sampled data.

1. Mathematical Foundations of MOTUS Frameworks

In the context of MRI motion field estimation, the MR-MOTUS (“Model-based Reconstruction Of non-rigid 3D MOTion fields from Undersampled k-space data”) framework operationalizes motion as time-dependent deformation fields acting on an anatomical reference. Given a fixed reference image $m_0(x)$ on $x \in \Omega \subset \mathbb{R}^3$ and a deformation field $\Phi(x, t)$ , the observed k-space signal is:

$s(k, t) = \int_{\Omega} m_0(x) \exp\left(-i2\pi k \cdot [x + \Phi(x, t)]\right) dx + \epsilon(k, t)$

This non-rigid signal model links the dynamic k-space signal directly to the underlying motion-field, bypassing intermediate dynamic image reconstruction. The estimation task becomes an inverse problem: recover $\Phi(x,t)$ given undersampled $k$ -space data and a reference image, typically by optimizing

$\min_{\Phi} \|F(\Phi) - s\|_2^2 + \lambda \mathcal{R}(\Phi)$

where $\mathcal{R}$ encodes priors such as spatial smoothness or Jacobian-based incompressibility. The practical tractability and identifiability are achieved by parameterizing motion via low-rank factorizations, spline bases, or (in other Motus frameworks) continuous variational autoencoders for motion latents (Huttinga et al., 2019, Huttinga et al., 2020, Huttinga et al., 2021, Olausson et al., 2024, Olausson et al., 4 Mar 2026).

In the domain of embodied AI, Motus introduces a unified latent action world model, encapsulating vision, language, and action distributions in a diffusion framework built upon a Mixture-of-Transformer (MoT) architecture that fuses semantic, generative, and control "experts." Latent actions are derived from optical flow and encoded through autoencoders, mapping high-dimensional sequences into compressed, learnable representations that facilitate training from heterogeneous or weakly labeled data (Bi et al., 15 Dec 2025).

Human motion modeling in Motus contexts similarly employs continuous latent variable models (e.g., VAEs), together with transformer-based bimodal attention mechanisms, to jointly handle language and motion in an autoregressive, unified sequence modeling framework (Zhu et al., 30 Jun 2025).

2. Motion-Field Parameterizations and Low-Rank Models

A defining feature of Motus-based estimation is explicit low-rank modeling of complex temporal deformations. In MRI, the space-time motion field $D(x, t)$ is factorized as:

$D(x, t) = \sum_{r=1}^R u_r(x) v_r(t) ~~\text{or}~~ D = \Phi \Psi^T$

where $x \in \Omega \subset \mathbb{R}^3$ 0 contains spatial basis fields and $x \in \Omega \subset \mathbb{R}^3$ 1 the temporal coefficients. Empirically, low ranks $x \in \Omega \subset \mathbb{R}^3$ 2 suffice for abdominal respiratory motion, reducing memory and computational footprint from $x \in \Omega \subset \mathbb{R}^3$ 3 to $x \in \Omega \subset \mathbb{R}^3$ 4, and enabling real-time or near-real-time 3D+t motion estimation (Huttinga et al., 2021, Huttinga et al., 2020).

For time-resolved contrast imaging (CMR-MOTUS), the reference image itself is allowed to vary temporally and is decomposed into a low-rank (anatomy) and sparse (contrast-dynamic) component:

$x \in \Omega \subset \mathbb{R}^3$ 5

with $x \in \Omega \subset \mathbb{R}^3$ 6 the low-rank baseline and $x \in \Omega \subset \mathbb{R}^3$ 7 the temporally sparse contrast inflow, regularized by nuclear norm and time-sparsity. Motion estimation and image/contrast separation alternate within a joint optimization loop (Olausson et al., 2024).

In the context of unified world models for robotics, Motus leverages pixel-level optical flow autoencoded into low-dimensional latent actions, forming the substrate for cross-modal generative modeling and efficient transfer to real-world robot policy learning. The latent representation aligns via $x \in \Omega \subset \mathbb{R}^3$ 8 and KL losses to enforce fidelity and match robot control statistics (Bi et al., 15 Dec 2025).

3. Applications Across Domains

a. MR-Guided Radiotherapy and Cardiac MRI

MR-MOTUS and CMR-MOTUS make possible high-temporal-resolution, non-rigid 3D+t motion tracking with frame rates up to 40.8 Hz (2D), 7.6–9.3 Hz (3D), and clinically relevant latency (170 ms including acquisition/reconstruction). This supports real-time MR-guided radiotherapy (MR-Linac), continuous cardiac imaging under arrhythmia (enabling beatwise ejection fraction analysis in PVC patients), and free-running myocardial perfusion MRI without preparatory scans (Huttinga et al., 2019, Huttinga et al., 2020, Huttinga et al., 2021, Olausson et al., 2024, Olausson et al., 4 Mar 2026).

b. Embodied Multimodal Agents

Motus proposes an integrated, latent action world model supporting diverse tasks: world modeling ( $x \in \Omega \subset \mathbb{R}^3$ 9), vision-language-action prediction ( $\Phi(x, t)$ 0), inverse dynamics, video-action joint prediction, and more. The architecture’s three-phase recipe—VGM adaptation, unified pretraining, target-robot fine-tuning—combined with a six-layer data pyramid yields transferability and data efficiency, with documented +15–45% improvements in simulated and real-robot success rates relative to established baselines (Bi et al., 15 Dec 2025).

c. Motion-Language Modeling and Multi-Object Tracking

Motion-language integration, as realized by MotionGPT3, demonstrates state-of-the-art results in biomechanical and animation tasks, with continuous motion latents enabling rich cross-modal reasoning in an autoregressive framework. Motion-Aware Transformers further leverage explicit motion forecasting to improve association accuracy in multi-object video tracking, yielding substantial improvements in HOTA and related metrics (Zhu et al., 30 Jun 2025, Yang et al., 26 Sep 2025).

d. Environmental and Wildlife Monitoring

The Motus Wildlife Tracking System employs automated VHF telemetry to track banded birds and bats. Accurate triangulation requires precise in situ calibration of antenna radiation patterns; field protocols and numerical kNN regression methods have been developed to produce high-fidelity, non-parametric models of $\Phi(x, t)$ 1, reducing localization error from kilometers to 100–500 m (Carlson et al., 2022).

4. Workflow, Optimization Strategies, and Data Pipelines

Motus-based workflows in dynamic MRI combine multi-stage optimization:

Offline phase: Moderate-duration optimization computes spatial motion bases from respiratory-binned or free-breathing contrast-enhanced data using total variation or volume-preserving regularization and B-spline parameterization.
Online phase: Rapid inference updates a small set of temporal weights (typically $\Phi(x, t)$ 2) for each new $\Phi(x, t)$ 3-space sample, permitting real-time tracking compatible with adaptive radiotherapy requirements (Huttinga et al., 2021).

In CMR-MOTUS, reconstruction alternates between reference-image (L+S) updates and low-rank motion field updates, using proximal-gradient/FISTA schemes and L-BFGS for B-spline fields. Implementation complexity is dominated by repeated FFTs, regridding, and (for image separation) SVD computations; efficient GPU acceleration is increasingly central for 3D real-time reconstructions (Olausson et al., 2024, Olausson et al., 4 Mar 2026).

Unified latent action world models adopt a multi-phase training pipeline:

VGM adaptation utilizing large-scale video and synthetic robot data
Joint pretraining for action/observation chunks incorporating frozen vision-LLMs
End-to-end fine-tuning on target robot demonstrations, leveraging hierarchically organized, multi-modal data (Bi et al., 15 Dec 2025)

5. Performance, Metrics, and Clinical/Empirical Validation

In MRI applications, MR-MOTUS achieves sub-10% relative error norms against compressed sensing and optical flow registration baselines for both rigid and non-rigid motion, with Jacobian determinants within 5% of unity inside organs, confirming physiological plausibility. In CMR-MOTUS scenarios, the SSIM over myocardium is 0.9780±0.0145; in vivo sharpness is enhanced by 14% over compressed sensing. In arrhythmic cohorts, beat-to-beat EF distributions derived from propagated motion fields replicate clinical reference values and reveal bimodal patterns aligned with temporally resolved ECG PVCs (Huttinga et al., 2019, Huttinga et al., 2020, Huttinga et al., 2021, Olausson et al., 2024, Olausson et al., 4 Mar 2026).

For embodied agents, simulation and real-world benchmarks show +11–48% relative performance increases; for multimodal motion-language modeling, text-to-motion FID reaches 0.22, substantially improving motion quality and diversity relative to discrete token-based models (Bi et al., 15 Dec 2025, Zhu et al., 30 Jun 2025).

Wildlife tracking site-calibration protocols yield mean absolute errors of 3–8 dBm (self) and 7–7.5 dBm (cross-site); embedding the resulting radiation patterns into Motus triangulation algorithms produces animal fix errors an order of magnitude smaller than when using theoretical antenna models (Carlson et al., 2022).

6. Limitations and Frontier Directions

Motus-based frameworks face computational bottlenecks in large-scale 3D reconstructions, with iterative alternations and high-dimensional SVDs (for L+S decompositions) challenging to scale without advanced GPU resources. Separation of motion from contrast or other dynamic sources is not always perfect, and empirical parameter tuning (rank, regularization penalties, basis parametrizations) remains necessary. In robotics and cross-modal learning, latent action models may underperform on tasks requiring micrometer precision or on objects with strongly out-of-distribution appearance or dynamics (Olausson et al., 2024, Bi et al., 15 Dec 2025, Zhu et al., 30 Jun 2025).

Ongoing directions include efficient multi-resolution motion fields, deeper integration with external motion surrogates (navigators, pilot-tone), structure-guided or learned regularization, and clinical trials assessing impact on therapy and diagnosis. In AI, further progress is anticipated in scaling multimodal, motion-centric LLMs to broader settings (object–scene interactions, multi-agent control), and in leveraging large-scale unlabeled data for action-space generalization (Olausson et al., 2024, Bi et al., 15 Dec 2025, Zhu et al., 30 Jun 2025, Olausson et al., 4 Mar 2026).

In summary, Motus methods provide a unified mathematical and algorithmic approach to motion estimation, modeling, and cross-modal integration across medical imaging, embodied AI, tracking, and environmental domains, grounded in low-rank, latent, and compressed representations optimized for scalability, interpretability, and real-world applicability (Huttinga et al., 2019, Huttinga et al., 2020, Huttinga et al., 2021, Olausson et al., 2024, Carlson et al., 2022, Zhu et al., 30 Jun 2025, Yang et al., 26 Sep 2025, Bi et al., 15 Dec 2025, Olausson et al., 4 Mar 2026).