Papers
Topics
Authors
Recent
Search
2000 character limit reached

SkyJEPA: Learning Long-Horizon World Models for Zero-Shot Sim-to-Real Control of Quadrotors

Published 22 Jun 2026 in cs.RO and cs.LG | (2606.23444v2)

Abstract: Accurate dynamics models are critical for informed decision-making in robotic systems, particularly for agile aerial vehicles operating under uncertainty. Neural network dynamics models are attractive for capturing complex nonlinear effects, but existing predictive approaches struggle with long-horizon forecasting because their autoregressive rollout mechanism amplifies errors over time. Joint Embedding Predictive Architectures (JEPAs) offer a compelling alternative by modeling dynamics in latent space, yet prior JEPA-style methods for robot navigation have been studied primarily for kinematic-level planning, with limited investigation in high-frequency control. In this work, we introduce the JEPA-style model for real-time quadrotor control. The proposed approach combines a latent dynamics model with a novel physics-inspired prober that maps frozen latents to interpretable state, enabling physically grounded long-horizon prediction. Additionally, we combine the learned model with a sampling-based optimal control solution to take advantage of its predictive capabilities for real-time control on embedded hardware. Finally, to reduce the dependence on expensive and unsafe real-world data collection, we develop a structured pipeline for automated dataset generation. Extensive open-loop and outdoor closed-loop experiments demonstrate accurate prediction, robust zero-shot sim-to-real transfer, and strong generalization across diverse operating conditions.

Summary

  • The paper presents an end-to-end JEPA-style latent dynamics model that reduces long-horizon prediction errors for quadrotor control.
  • It employs GRU-based predictors and a physics-inspired prober to achieve a 3–8x reduction in position and attitude errors.
  • Robust sim-to-real transfer is demonstrated with a 45% improvement in tracking accuracy using domain-randomized, lightweight inference.

SkyJEPA: JEPA-Style Latent World Modeling for Sim-to-Real Quadrotor Control

Motivation and Context

Reliable model-based control of aerial platforms requires accurate long-horizon forecasting of system dynamics, interpretability for enforcing physical constraints, real-time computational efficiency, and transferability across tasks and platforms. Traditional analytical models, despite encoding known structure, often suffer from limited fidelity due to real-world nonlinearities and hardware variation, necessitating extensive system identification and manual tuning. Neural network predictive models offer flexibility to capture such effects but exhibit instability over long horizons due to error compounding in autoregressive rollouts. Model-based RL approaches couple the world model to specific reward distributions, hindering task-agnostic generalization. These limitations motivate a paradigm shift towards compact, latent-space world modeling using JEPA-style architectures for aerial robot control.

Methodological Contributions

SkyJEPA introduces an end-to-end JEPA-style latent dynamics model for high-frequency quadrotor control. Instead of direct autoregressive state prediction, the architecture operates in latent space, learning temporally coherent dynamics representations informed by state-action history encoding. The system comprises:

  • Temporal Convolutional Encoders: Compact state and action history encodings capturing temporal trends and unmodeled system effects.
  • GRU-Based Latent Dynamics Predictor: Recursive prediction of latent representations over multi-step horizons, mitigating compounding error.
  • Physics-Inspired Prober: Structured residual-kinematic integration mapping latent rollouts to physically meaningful metric state variables, enabling controller compatibility.

The JEPA-style loss combines multi-step prediction consistency in latent space with SIGReg anti-collapse regularization. The SIGReg penalty enforces an isotropic Gaussian prior on projected latent distributions via the Epps–Pulley statistic, ensuring representation diversity and temporal coherence. The probing mechanism decodes frozen latent rollouts through differentiable residual-kinematic integration, preserving rigid-body geometry and enabling physically grounded rollouts suitable for real-time trajectory optimization.

Domain-Randomized Dataset Synthesis

Training is performed entirely on simulation-generated datasets via domain randomization, systematically varying quadrotor physical parameters (mass, inertial matrix, thrust/torque coefficients, drag, motor time constants) to expose the model to a broad family of plausible real-world dynamics. Reference trajectories are automatically synthesized using multi-period Gaussian processes and tracked using NMPC and MPPI controllers for data diversity and dynamic feasibility. The resulting dataset densely covers the state, action, and transition spaces, promoting robustness to platform variation and enabling zero-shot sim-to-real transfer.

Sampling-Based Real-Time Control

Within the deployment pipeline, the learned world model is embedded in a sampling-based MPPI controller. At each control step, multiple candidate action sequences are sampled and rolled out through the latent dynamics and prober stack; trajectory costs reflecting state deviation and control effort are computed, and optimal actions are selected via softmax weighting. The design enables high-frequency real-time control on embedded hardware (NVIDIA Orin NX), with lightweight inference facilitated by compact network architectures and TensorRT optimization.

Empirical Results

Long-Horizon Prediction: Compounding ratio (CR) and error growth (ER) analyses demonstrate that latent-space dynamics modeling substantially dampens recursive error accumulation compared to autoregressive state-space predictors. CR remains near unity for extended horizons, and ER is consistently lower, signifying stable temporal evolution.

Temporal Straightening: Latent rollouts exhibit high temporal straightening, as quantified by cosine similarity of consecutive latent velocities, indicating more consistent and directionally aligned trajectory evolution in representation space and facilitating robust long-horizon recursion.

Physics Interpretability: Integration of the physics-inspired prober yields significant improvements in metric-state prediction, reducing position RMSE and attitude error by factors of 3–8 compared to unconstrained latent or predictive baselines.

Sim-to-Real Transfer: The framework demonstrates robust zero-shot transfer, achieving up to 45% reduction in tracking error and attitude deviation on real-world trajectories compared to learned predictive baselines trained with the same simulation data. No fine-tuning or real-world retraining is required.

Robustness: Quantitative and qualitative evaluations under propeller switching and payload modifications establish superior resilience to platform variation, with consistent tracking performance and reduced error variance, outperforming predictive state-based alternatives.

Data Distribution Quality: The Trajectory Distribution Quality (TDQ) score, based on entropy and coverage across clustered state-action, transition, and parameter spaces, exhibits clear correlation with prediction accuracy, validating the efficacy of domain-randomization and synthetic trajectory strategies for world model learning.

Implications and Speculation

SkyJEPA provides strong evidence that spatially and temporally structured latent world models, regularized for diversity and grounded by physical decoders, can achieve real-time, robust, and generalizable control in aerial robotics without dependence on risky real-world data collection. The architecture offers a scalable foundation for extending JEPA-style modeling to higher-dimensional sensory domains (e.g., vision-based navigation), where autoregressive reconstruction is both computationally prohibitive and semantically suboptimal. Further integration of safety-centric structures and uncertainty-aware planning objectives can enhance reliability in cluttered and uncertain environments. The simplicity of the SIGReg regularization, with minimal hyperparameter sensitivity, suggests practical adoption for general world modeling tasks, potentially catalyzing broader advances in latent-space planning and sim-to-real transfer throughout robotics.

Conclusion

SkyJEPA advances the state of neural world modeling for aerial robot control by combining JEPA-style latent dynamics learning, structured physics-based decoding, and comprehensive domain randomization for dataset synthesis. The resulting framework achieves improved long-horizon stability, robust sim-to-real deployment, and resilience to platform and environmental variation, all within real-time computational budgets. Extensions to vision-based navigation and enhanced safety integration represent promising directions for future work.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

Explain it Like I'm 14

Overview of the paper

This paper is about teaching small flying robots (quadrotor drones) to predict how they will move in the future so they can control themselves safely and smoothly. The authors build a new kind of “world model” that helps a drone plan and act in real time, even when things change (like wind, extra weight, or different propellers). They train this model mostly in a simulator and then show it works in the real world without extra tuning.

Key objectives and questions

The paper asks: Can we create a learned model for drones that:

  • Predicts accurately far into the future (not just the next step)?
  • Uses understandable physical quantities (like position, speed, and angle) so we can enforce safety and limits?
  • Runs fast enough on a small onboard computer to control the drone in real time?
  • Works in the real world “zero-shot,” meaning it was trained in simulation only, but still flies well outdoors without extra training?

How did they do it? Methods and approach

The researchers combined a few ideas to reach these goals. Here’s how their system works, explained with simple analogies:

A world model in “latent space” (JEPA)

  • Instead of predicting every tiny detail of the drone’s future state step by step, they use a JEPA-style approach (Joint Embedding Predictive Architecture).
  • Think of this as predicting in a “secret code” or summary language (called latent space) that captures what matters for control, without getting distracted by noise.
  • Why this helps: If you repeatedly feed your own slightly-wrong predictions back into the model, small errors can snowball over time (like telling a long story one sentence at a time and slightly misremembering each previous sentence). Predicting in a compact, stable code helps reduce this error build-up over long horizons.

A physics-inspired “prober” (a translator)

  • The world model’s “secret code” still needs to be turned into real, physical values (position, velocity, angles, etc.) that controllers understand.
  • The authors add a “prober” that acts like a translator: it converts the code into physical quantities using simple physics equations (how thrust changes speed, how rotation changes orientation), plus small learned corrections for real-world effects (like drag and motor delays).
  • This keeps predictions physically meaningful and grounded.

A sampling-based controller (MPPI)

  • To decide what the drone should do next, they use a controller called MPPI.
  • Think of MPPI like this: it tries many different possible sequences of motor commands for the next second, uses the world model to “imagine” what would happen, scores each imagined future, and then picks a smart average of the best ones.
  • This runs quickly on the drone’s onboard computer, making it suitable for real-time control.

Training data from simulation with domain randomization

  • Collecting lots of real flight data is risky and time-consuming. So they build an automated simulator pipeline:
    • They generate many smooth, varied flight paths using Gaussian processes (a way to create trajectories that are random but not jerky).
    • They track these paths in simulation using two controllers to produce realistic state-action data.
    • They randomize the drone’s simulated physical parameters (mass, drag, motor response, etc.) over many “domains” so the model learns to handle different conditions.
  • This “train in a video game with lots of variations” strategy helps the model transfer to real-world flying, even though it never saw real-world training data.

Real-time on embedded hardware

  • The model is small and efficient (about 9,000 parameters) and runs at high speed on a tiny onboard computer (NVIDIA Jetson Orin NX), using TensorRT for fast inference.
  • Hitting around 100 updates per second is key for stable flight.

Main findings and why they matter

Here are the most important results:

  • Better long-horizon prediction: The model stays accurate when predicting further into the future, reducing the “error snowball” problem common in step-by-step predictive models.
  • Physically meaningful outputs: Thanks to the prober, the model produces interpretable states (like position and orientation), which lets the controller enforce limits and safety.
  • Real-time control on the drone: The whole system runs fast enough on embedded hardware to control the drone outdoors.
  • Zero-shot sim-to-real: Even though it was trained only in simulation, it flew well outdoors without extra finetuning.
  • Robust to changes: It handled different scenarios like carrying extra payload or switching propellers, showing strong generalization.

These results are important because drones must make reliable decisions quickly in changing environments. A model that predicts far ahead, stays stable, and runs fast on-board is a big step toward safer, more capable autonomous flight.

Implications and potential impact

  • Safer, more agile drones: With long-horizon, physics-aware predictions, drones can plan better and avoid unstable behavior during fast maneuvers.
  • Less risky data collection: Training mostly in simulation saves time, lowers cost, and avoids crashes, while still transferring to real-world use.
  • General-purpose model: Because the world model is not tied to a single task or reward, it can be reused across different routes, objectives, and even slightly different drone setups.
  • Broader robotics use: The same ideas—predicting in a compact “code,” translating to physical states, and using sampling-based control—could help other robots (ground, marine, manipulation) operate more reliably in the real world.

In short, the paper shows a practical way to learn fast, accurate, and physically grounded models that make drones smarter and safer, all while reducing the need for dangerous real-world training flights.

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

The following list distills what remains uncertain or unexplored in the paper and highlights concrete directions for future work:

  • Lack of theoretical guarantees: no analysis of long-horizon latent rollout stability, contraction properties of the predictor, or closed-loop stability of the MPPI + JEPA + prober stack under modeling errors.
  • Full-state assumption: the method assumes access to accurate full-state histories; extensions to partial, noisy, and biased observations (e.g., vision-inertial, GPS outages) are not developed or evaluated.
  • Sensing realism: training uses noise-free simulator states; the impact of realistic sensor noise, latency, bias, dropout, and estimator lag on learned dynamics and control performance is not quantified.
  • Actuation mismatch: training and prober operate on rotor-force actions, while deployment issues collective thrust and body-rate commands to PX4; the mapping between these action spaces and its effect on sim-to-real transfer is not specified or validated.
  • Time-scale mismatch: the dataset is resampled at 20 Hz and the model is trained with T=20 (≈1 s at 20 Hz), whereas control targets ≥100 Hz; how multi-rate execution (substepping, interpolation, or holding) affects prediction fidelity and control is not analyzed.
  • Disturbance modeling: domain randomization omits external disturbances such as steady winds, gusts, ground effect, and turbulence; robustness under these conditions and the benefits of randomizing them remain untested.
  • Actuator/supply non-idealities: ESC dynamics, voltage sag, thermal effects, saturation, and motor failures are not modeled or randomized; robustness to such real-world non-idealities is unknown.
  • Coverage of aggressive regimes: GP-generated trajectories may under-excite high angular rates, accelerations, and fast transients; it is unclear whether the dataset adequately covers aggressive flight envelopes (e.g., flips, high-speed turns).
  • Domain randomization design: parameter ranges (e.g., ±50% mass, ±50% thrust/torque) and inter-parameter correlations are not justified; potential for unrealistic combinations and their impact on representation learning is unaddressed.
  • Prober expressivity and identifiability: the residual kinematic prober (Δv̇ and K_t a_t) is simple and may not capture velocity- and orientation-dependent aerodynamics or gyroscopic couplings; identifiability across randomized domains and limits of this parameterization are not studied.
  • Latent-to-metric consistency: the two-stage training freezes latents before prober fitting; potential mismatch between latent dynamics and prober assumptions (and resulting suboptimal control) is not quantified.
  • Uncertainty-awareness: the model provides point predictions without epistemic/aleatoric uncertainty; how to incorporate uncertainty into MPPI (e.g., ensembles, probabilistic latents) to improve safety and robustness is left open.
  • Hyperparameter sensitivity: no ablations for history length H, horizon T, latent dimension, λ_sig, number of SIGReg projections M, MPPI sample count S, or temperature λ; operational robustness to these choices is unknown.
  • Anti-collapse alternatives: the benefit of SIGReg relative to other anti-collapse methods (e.g., VICReg, BYOL variants, variance/whitening penalties) is not compared; sensitivity to SIGReg settings is not reported.
  • Architecture choices: the impact of using GRUs/TCNs versus alternatives (e.g., Transformers, linear state-space models, continuous-time models) for long-horizon prediction and real-time compute is not explored.
  • Safety and constraints: only action clamping is used; explicit state and actuator constraints, tilting/acceleration limits, and geofencing are not enforced or guaranteed; integration of constrained MPPI/barrier functions remains open.
  • Failure modes: no systematic analysis of edge cases (strong winds, GPS dropouts, actuator degradation, extreme payloads); lack of diagnostics and recovery strategies under large distribution shifts.
  • Real-world scope: evaluations focus on nominal tracking, payload change, and propeller switching; performance under broader conditions (wind, temperature, battery states, terrain proximity/ground effect) and across larger payload ranges is not reported.
  • Multi-rate estimator–controller interaction: the effect of estimator latency and noise on MPPI sampling and latent rollouts (closed-loop delay margins) is not characterized.
  • Long-horizon planning limits: beyond ≈1 s prediction, trade-offs between horizon length, rollout fidelity, and control performance are not quantified.
  • Compute portability: results rely on Jetson Orin NX + TensorRT; feasibility on lower-power processors (e.g., microcontrollers, smaller Jetsons) and the compute–accuracy trade-off are unassessed.
  • Energy and thermal budget: the impact of onboard compute load on flight time, thermal throttling, and reliability is not measured.
  • Task generality: while decoupled from rewards, demonstrations are limited to trajectory tracking; applicability to other tasks (e.g., obstacle avoidance, payload transport with constraints, interaction) is not validated.
  • Platform generalization: claims of generalization across platform variations are limited to a single quadrotor class; transfer to different sizes/form factors (e.g., micro, heavy-lift) and to other robot types remains open.
  • Data–controller coupling: the prober is trained with action sequences “generated by the optimal control procedure,” yet it is unclear whether this induces distributional coupling and how it affects generalization to different controllers/costs.
  • Dataset reproducibility: detailed parameter ranges, correlations, seeds, and the release of code/data (simulator, randomization pipeline, trained models) are not specified, limiting reproducibility and external validation.

Practical Applications

Below is a structured synthesis of practical applications that follow directly from the paper’s findings and innovations (JEPA-style latent world model, physics-inspired prober for interpretable states, sampling-based control integration, and domain-randomized sim-to-real dataset pipeline).

Immediate Applications

  • Robust, zero-shot trajectory tracking for commercial quadrotors
    • Sector: Robotics, logistics, inspection, cinematography, agriculture
    • What it enables: More stable, long-horizon flight in wind and under hardware changes (payloads/propellers) without retuning; improved tracking for missions like package drop-offs, line/bridge inspections, crop surveys, and film shots
    • Tools/products/workflows: ROS2 package that plugs into PX4; onboard JEPA model (TensorRT optimized) + C++ MPPI controller; reference trajectory uploader; on-drone constraint enforcement via prober outputs (position/velocity/attitude limits)
    • Assumptions/dependencies: Full-state estimation available (e.g., GPS/IMU); Jetson Orin NX–class compute for real-time rollouts; operator compliance with airspace rules
  • Safer data acquisition via simulation-first training
    • Sector: Robotics R&D, drone manufacturers, service providers
    • What it enables: Large-scale, automated dataset generation with domain randomization to reduce risky/expensive flight tests; coverage of broad flight envelopes before any real-world sorties
    • Tools/products/workflows: Dataset synthesizer (Gaussian-process trajectory generator + NMPC/MPPI closed-loop tracker in sim); parameter randomization configs; MLOps scripts to export trained models
    • Assumptions/dependencies: Adequate simulator fidelity (rigid-body + drag + actuator delay) and realistic parameter ranges; compute resources for large-batch training
  • Interpretable, constraint-aware control using latent-to-state probing
    • Sector: Safety-critical robotics, compliance and certification teams
    • What it enables: Explicit access to physical states (position, velocity, attitude, angular rates) from latent rollouts for constraint checks, actuator limits, and safety margins inside the controller
    • Tools/products/workflows: Physics-inspired prober module with stop-gradient training; runtime constraint monitors using probed states; mission rule-checkers
    • Assumptions/dependencies: Prober accuracy within operating envelope; well-chosen state/control penalty weights; calibrated actuator/sensor bounds
  • Onboard “what-if” trajectory evaluation for field operators
    • Sector: Utilities, construction, public safety
    • What it enables: Rapid sampling and evaluation of candidate trajectories on embedded hardware to pick safer/more energy-efficient paths on site
    • Tools/products/workflows: MPPI sampling UI; mission preview and scoring (tracking error, effort) using learned dynamics; on-drone plan selection
    • Assumptions/dependencies: Embedded inference within ~10 ms control budget; reliable state estimates; mission-specific cost tuning
  • Faster platform bring-up and variations handling for OEMs
    • Sector: Drone manufacturing and integration
    • What it enables: Reduced hand-tuning for new SKUs or field variations (prop changes, payload mounts); quicker pilots, demos, and customer PoCs
    • Tools/products/workflows: Domain-randomization presets reflecting SKU variants; hardware-in-the-loop test harness; pre-trained model drop-in for new builds
    • Assumptions/dependencies: Parameter ranges in training cover expected hardware variability; seamless PX4/ROS2 integration
  • Teaching and research modules for long-horizon world models
    • Sector: Academia (controls, ML, robotics)
    • What it enables: Course projects and labs on JEPA-style dynamics, anti-collapse regularization (SIGReg), physics-informed probing, and sim-to-real validation
    • Tools/products/workflows: Open-source training scripts; benchmark datasets; metric suite (compounding ratio and error growth); ablation templates
    • Assumptions/dependencies: Access to GPU for training; standard simulators (Gazebo/AirSim) or provided sim
  • Off-the-shelf upgrade path for prosumer drones
    • Sector: Consumer/prosumer UAVs, small businesses
    • What it enables: More robust tracking under wind and minor hardware changes with minimal user tuning; safer roof/facade/asset inspections
    • Tools/products/workflows: Firmware add-on/app enabling JEPA controller mode; model profiler to auto-select horizon T and sample count S per device
    • Assumptions/dependencies: Sufficient onboard compute or co-processor; reliable GNSS or alternative state estimation; regulatory compliance
  • Evidence and metrics to inform safety reviews
    • Sector: Policy, regulators, certification bodies
    • What it enables: Adoption of long-horizon stability metrics (compounding ratio, error growth) and interpretable prober outputs to assess ML-based flight controllers
    • Tools/products/workflows: Standardized test protocols using domain-randomized scenarios; report templates quantifying long-horizon fidelity and constraint violations
    • Assumptions/dependencies: Agreement on benchmark tasks/envelopes; test site access; openness of model artifacts for audit

Long-Term Applications

  • BVLOS autonomy with resilient world models
    • Sector: Logistics, infrastructure, environmental monitoring, public safety
    • What it could enable: Robust beyond-visual-line-of-sight operations under varying winds/hardware wear, with reduced revalidation effort
    • Tools/products/workflows: Fleet-level validation in domain-randomized digital twins; automated mission cost tuning per geography; onboard anomaly detection from residuals
    • Assumptions/dependencies: Regulatory approval; high-fidelity digital twins; reliable comms and detect-and-avoid stack
  • GPS-denied or vision-in-the-loop control
    • Sector: Warehousing, indoor inspection, mining, disaster response
    • What it could enable: Replace “full-state input” assumption with onboard perception (cameras/LiDAR) by extending JEPA to multimodal observations
    • Tools/products/workflows: JEPA encoders for images/LiDAR; learned latent fusion; robust prober trained with self-supervised/SLAM priors
    • Assumptions/dependencies: Perception model robustness; tight latency budget; lighting/dust tolerances
  • Cross-platform adoption (VTOL, fixed-wing, legged/ground robots)
    • Sector: Robotics at large (mobility, last-mile, AGVs)
    • What it could enable: Transferable JEPA-style world models with physics-aware probing tailored to each platform’s kinematics
    • Tools/products/workflows: Prober libraries for different Lie groups (SO(3), SE(2/3)); platform-specific domain randomization suites; unified MPPI interfaces
    • Assumptions/dependencies: Correct physical priors per platform; sufficient sim coverage; controller harmonization
  • Swarm planning with sampled latent rollouts
    • Sector: Defense, environmental sensing, agriculture
    • What it could enable: Decentralized “what-if” coordination where each agent predicts long-horizon outcomes under interaction constraints
    • Tools/products/workflows: Multi-agent MPPI with collision/communication constraints in cost; distributed inference optimization
    • Assumptions/dependencies: Scalable onboard compute; reliable inter-agent comms; safe emergent behavior guarantees
  • Formal safety layers leveraging interpretable latent-to-state maps
    • Sector: Certification, insurance, mission assurance
    • What it could enable: Runtime verification and barrier certificates using probed states over predicted horizons
    • Tools/products/workflows: Formal methods toolchain tied to the prober; monitor-act architecture to override unsafe samples; audit logs
    • Assumptions/dependencies: Verified bounds on model error; compositional proofs with learned components
  • Predictive maintenance and health monitoring
    • Sector: Fleet operations, OEM support, leasing/insurance
    • What it could enable: Detect subtle drift (e.g., motor degradation, prop damage) from changes in required residuals (∆v̇, K) and control effort trends
    • Tools/products/workflows: Residual trend analytics; alert thresholds; maintenance scheduling recommender
    • Assumptions/dependencies: Stable baselines; labeled incident data; robust change-point detection
  • Energy-aware and endurance-optimized planning
    • Sector: Logistics, long-duration monitoring
    • What it could enable: Sampling-based control that explicitly trades tracking error vs energy use using learned dynamics
    • Tools/products/workflows: Cost shaping with battery model; horizon adaptation by SOC; mission segmentation for loitering/cruising
    • Assumptions/dependencies: Accurate energy/thrust models; battery health estimation; environment forecasts (wind)
  • Cloud-edge training and continuous improvement pipelines
    • Sector: Robotics SaaS, platform providers
    • What it could enable: Periodic re-training with synthetic + curated real-world logs; OTA model updates validated in domain-randomized twins
    • Tools/products/workflows: Data versioning; domain randomization governance; A/B flight trials; rollback mechanisms
    • Assumptions/dependencies: Secure OTA; telemetry bandwidth; strong validation gates to prevent regressions
  • Standardized sim-to-real benchmarks and competitions
    • Sector: Academia, industry consortia
    • What it could enable: Shared datasets and tasks focused on long-horizon stability and zero-shot transfer; drive progress beyond autoregressive baselines
    • Tools/products/workflows: Open benchmark suite (GP trajectories, domain sets, metrics like CR/ER); leaderboards
    • Assumptions/dependencies: Community buy-in; fair and reproducible protocols
  • Regulatory frameworks for ML-based flight controllers
    • Sector: Policy and standards
    • What it could enable: Certification paths that recognize interpretable world models and long-horizon stability evidence
    • Tools/products/workflows: Compliance checklists (data generation, domain coverage, safety monitors); simulator-based proving grounds
    • Assumptions/dependencies: Engagement with authorities; harmonization across regions; liability clarity
  • Integrated digital twins for site-specific mission rehearsal
    • Sector: Energy, construction, municipalities
    • What it could enable: Pre-deployment rehearsal under site-specific wind/topography with domain randomization to bound risk
    • Tools/products/workflows: Parametrized local twins; mission template library; evidence packages for stakeholders
    • Assumptions/dependencies: Site data availability; twin fidelity; change management processes
  • General-purpose world-model toolkits for control research
    • Sector: Software/AI tooling
    • What it could enable: Reusable JEPA + prober stacks for diverse control domains with plug-and-play encoders and SIGReg-style regularizers
    • Tools/products/workflows: Modular libraries; recipe catalogs (hyperparameters, history windows, horizons); visualization for latent isotropy and rollout stability
    • Assumptions/dependencies: Broad community testing; documentation; maintenance and support

Notes on common dependencies and risks across applications:

  • Sensor quality and state estimation accuracy are critical; current results assume reliable full-state inputs (GPS/IMU). GPS-denied scenarios require additional perception work.
  • Real-time constraints typically need embedded GPUs (e.g., Jetson Orin NX) and careful tuning of horizon T and sample count S to stay under control-loop budgets.
  • Domain-randomization coverage must reflect real-world variability; gaps can degrade zero-shot transfer.
  • Weather, wind gusts, and electromagnetic interference can introduce distribution shift; safety layers and operational constraints remain necessary.
  • Regulatory acceptance of ML-based controllers depends on interpretability, testing rigor, and evidence of long-horizon stability.

Glossary

  • actuator delay: A latency in the response of motors or actuators to control commands, often modeled as a first-order dynamic. "This requires capturing both the dominant rigid-body dynamics and difficult-to-model effects such as aerodynamic drag, actuator delay, propeller--airframe interactions, wind disturbances, and hardware variations."
  • aerodynamic drag: The resistive force opposing motion of the vehicle through air, affecting translational dynamics. "This requires capturing both the dominant rigid-body dynamics and difficult-to-model effects such as aerodynamic drag, actuator delay, propeller--airframe interactions, wind disturbances, and hardware variations."
  • ancestral sampling: A procedure for drawing parameter samples from a structured probabilistic model, ensuring physically plausible combinations. "Instead, each rollout is collected from a randomized quadrotor model using a similar ancestral sampling approach described in \cite{eschmann2026raptor}."
  • anti-collapse regularization: A regularization strategy to prevent learned representations from collapsing to trivial (e.g., constant) solutions. "We therefore utilize an anti-collapse regularization term."
  • autoregressive: A modeling approach where predictions are fed back as inputs for subsequent predictions, often causing error accumulation over long horizons. "Neural network dynamics models are attractive for capturing complex nonlinear effects, but existing predictive approaches struggle with long-horizon forecasting because their autoregressive rollout mechanism amplifies errors over time."
  • characteristic function: A complex-valued function that uniquely defines a probability distribution, used here to compare latent projections to a Gaussian. "denote the empirical characteristic function of the projected samples, and let ϕ0(t)\phi_0(t) denote the characteristic function of the standard Gaussian N(0,1)\mathcal{N}(0,1)."
  • differential flatness: A system property enabling trajectory generation by specifying a set of flat outputs and their derivatives. "and then use differential flatness~\cite{mellinger2011minimum} to obtain the full quadrotor reference."
  • distribution shift: A change between training and deployment data distributions that can degrade model performance. "online methods adapt the model during deployment to account for changing dynamics or distribution shift~\cite{fu2016one, saviolo2023active, wang2018safe, lew2022safe, o2022neural, jiahao2023online}."
  • domain randomization: Training with randomized simulator parameters to improve robustness and transfer to real-world variations. "we propose a domain-randomized simulation pipeline for automated dataset generation, reducing the need for extensive and potentially unsafe real-world data collection."
  • Epps--Pulley test statistic: A univariate goodness-of-fit statistic used to compare sample distributions, here for projected latent variables. "For each projection, we evaluate the univariate Epps--Pulley test statistic, measuring the distribution mismatch."
  • exocentric simulation: Simulation from an external viewpoint rather than from the robot’s onboard perspective, often used in vision-centric studies. "with several demonstrated only on toy problems or in exocentric simulation without real-world validation~\cite{yin2026ddp, maes2026leworldmodel, sobal2025learning}."
  • exponential map: A mapping from angular velocities (in Lie algebra) to rotations (on the Lie group), preserving manifold structure in attitude updates. "This compact integrator preserves the geometric structure of the attitude dynamics through the SO(3)\mathrm{SO}(3) exponential map, while allowing the latent representation to correct for unmodeled translational and rotational dynamics effects."
  • exponential moving averages: A smoothing technique that tracks running averages with exponential decay, often used in self-supervised training. "This is in contrast to many self-supervised representation learning objectives that require balancing several regularization terms~\cite{bardes2021vicreg, sobal2025learning}, stop-gradient~\cite{grill2020bootstrap} design choices, exponential moving averages, or reconstruction weights."
  • first-order motor delay: A simple dynamic model of actuator lag where the response follows a first-order system. "Data collection is performed in simulation using standard quadrotor rigid-body dynamics~\cite{song2021flightmare}, with aerodynamic drag and first-order motor delay included."
  • flight envelope: The range of operating conditions (states and inputs) within which the vehicle can safely and effectively operate. "The data collection process must informative (e.g., efficiently covering the overall flight envelope) and sample efficient and therefore covering a diverse range of states, control inputs, speeds, accelerations, and operating conditions."
  • Gaussian kernel: A weighting function based on the Gaussian (normal) distribution, commonly used in kernel-based statistics. "where w(t)w(t) is a weighting function, typically chosen as a Gaussian kernel."
  • Gaussian process: A nonparametric Bayesian model defining distributions over functions, used here to generate smooth reference trajectories. "A Gaussian process, denoted by GP(0,kj)\mathcal{GP}(0,k_j), defines a distribution over smooth functions with zero mean and covariance kernel kj(t,t)k_j(t,t')."
  • GRU predictor: A Gated Recurrent Unit-based recurrent neural network used to model temporal latent dynamics. "Latent dynamics are modeled using a single-layer GRU predictor~\cite{pmlr-v37-chung15} with hidden dimension $24$"
  • importance weights: Weights assigned to sampled trajectories based on their costs, used to update control sequences in MPPI. "Following the MPPI formulation~\cite{williams2017information}, we compute importance weights"
  • inertia matrix: A 3x3 matrix characterizing the rotational inertia of the vehicle about its principal axes. "where mm is the mass, D\mathbf{D} is the drag matrix, J\mathbf{J} is the inertia matrix, α\alpha is the motor time constant, kfk_f and kτk_{\tau} are the thrust and torque coefficients, and ll is the arm length."
  • isotropic Gaussian distribution: A multivariate normal distribution with identical variance in all directions and zero covariance. "We employ Sketched Isotropic Gaussian Regularization (SIGReg)~\cite{balestriero2025lejepa}, which encourages the latent embeddings to match an isotropic Gaussian distribution."
  • Joint Embedding Predictive Architectures (JEPAs): Models that predict future embeddings in a latent space instead of reconstructing observations. "Joint Embedding Predictive Architectures (JEPAs) offer a compelling alternative by modeling dynamics in latent space"
  • kinematic model: A dynamics-agnostic model describing motion relationships (e.g., positions, velocities) without forces, used here as a differentiable structure. "We therefore introduce a physics-inspired probing mechanism that maps latent rollouts to interpretable state trajectories through a differentiable kinematic model."
  • latent dynamics: The evolution of a compact, learned representation of the system state over time. "We introduce the JEPA-style model for real-time quadrotor control. The proposed approach combines a latent dynamics model with a novel physics-inspired prober"
  • length scale: A kernel hyperparameter controlling the smoothness or variability of functions in Gaussian processes. "Each axis uses three periodic components: the first has length scale $1.3$, while the remaining two have length scales $3.0$ and $4.0$;"
  • Model Predictive Path Integral (MPPI): A sampling-based optimal control algorithm that uses path integral principles to update control sequences. "We integrate our learned dynamics model within a sampling-based optimization framework, MPPI, where future action sequences are optimized using Monte Carlo sampling."
  • Model-Based Reinforcement Learning (MBRL): Reinforcement learning approaches that use a learned dynamics model for planning and control. "Model-Based Reinforcement Learning (MBRL) provides an alternative route by jointly learning a world model and a task-driven policy from interaction data"
  • Nonlinear Model Predictive Control (NMPC): An optimization-based control method that solves finite-horizon nonlinear control problems online. "where $\pi_{\mathrm{track}$ is implemented using a combination of nominal Nonlinear Model Predictive Control (NMPC) and Model Predictive Path Integral (MPPI)~\cite{williams2017information}."
  • periodic kernel: A kernel function that models periodic structure in data, used within Gaussian processes. "Each kernel is chosen as a sum of periodic kernels with different characteristic length scales and periods."
  • physics-inspired prober: A learned module that maps latent trajectories to interpretable physical states by incorporating simple physics structure. "The proposed approach combines a latent dynamics model with a novel physics-inspired prober that maps frozen latents to interpretable state"
  • quadrature: Numerical integration techniques used to approximate integrals, here for the test statistic computation. "In practice, the integral in T(m)T^{(m)} is evaluated numerically using quadrature, following~\cite{balestriero2025lejepa}."
  • quadrotor rigid-body dynamics: The six-degree-of-freedom equations of motion governing a four-rotor aerial vehicle. "Data collection is performed in simulation using standard quadrotor rigid-body dynamics~\cite{song2021flightmare}, with aerodynamic drag and first-order motor delay included."
  • receding-horizon: A control strategy that repeatedly solves a finite-horizon problem and applies only the first control input. "Only the first action $\mathbf{a}^{\mathrm{nom}_{0}$ is executed, and the procedure is repeated in a receding-horizon fashion."
  • residual learning: Learning corrections to a nominal model rather than modeling the full dynamics from scratch. "Prior offline approaches have explored residual learning over nominal models~\cite{bauersfeld2021neurobem, kulathunga2024residual}"
  • residual translational acceleration: An additive learned correction to nominal translational acceleration to account for unmodeled effects. "Here, $\Delta \mathbf{\dot{v}_{t+k} \in \mathbb{R}^3$ represents a residual translational acceleration,"
  • rotation matrix: A 3x3 orthonormal matrix representing 3D orientation. "The attitude is represented by the rotation matrix"
  • ROS2: A robotics middleware framework for distributed communication and real-time control. "with the full software stack integrated through ROS2."
  • sampling-based optimization: Optimization methods that explore candidate solutions via random sampling rather than gradient-based updates. "we demonstrate that by integrating the learned latent dynamics model within a sampling-based optimization framework, we exploit the predictive capabilities of the learned model for real-time control"
  • SIGReg (Sketched Isotropic Gaussian Regularization): A regularizer that matches random projections of embeddings to a standard normal to prevent collapse and encourage isotropy. "We employ Sketched Isotropic Gaussian Regularization (SIGReg)~\cite{balestriero2025lejepa}, which encourages the latent embeddings to match an isotropic Gaussian distribution."
  • sim-to-real transfer: The process of applying models trained in simulation directly to real-world systems without additional tuning. "Extensive open-loop and outdoor closed-loop experiments demonstrate accurate prediction, robust zero-shot sim-to-real transfer, and strong generalization across diverse operating conditions."
  • SO(3): The special orthogonal group of 3x3 rotation matrices representing 3D orientations. "Rt=[rx,try,trz,t]SO(3)\mathbf{R}_t = \begin{bmatrix} \mathbf{r}_{x,t} & \mathbf{r}_{y,t} & \mathbf{r}_{z,t} \end{bmatrix} \in \mathrm{SO}(3)"
  • stop-gradient: A training operation that prevents gradients from flowing through certain tensors, decoupling objectives. "During prober training, a stop-gradient operation is applied to the predicted latent embeddings before they are passed to ψ\psi."
  • Temporal Convolutional Networks (TCNs): Causal convolutional architectures for sequence modeling with receptive fields over time. "Both Encθ\mathrm{Enc}_\theta and Encϕ\mathrm{Enc}_\phi are implemented as Temporal Convolutional Networks (TCNs)~\cite{lea2017temporal}"
  • TensorRT: NVIDIA’s inference optimization and runtime framework for accelerating neural network deployment. "The learned PyTorch latent dynamics model is exported and optimized using NVIDIA TensorRT for accelerated inference on the NVIDIA Jetson Orin NX."
  • thrust coefficient: A parameter relating motor input to produced thrust in the propellers. "kfk_f and kτk_{\tau} are the thrust and torque coefficients,"
  • torque coefficient: A parameter relating motor input to reaction torque generated by propellers. "kfk_f and kτk_{\tau} are the thrust and torque coefficients,"
  • visitation distribution: The distribution of states visited by a policy or controller, which can bias learned models. "the learned model is often coupled to a particular reward, policy, or visitation distribution."
  • Monte Carlo sampling: A stochastic technique using random samples to approximate solutions, here for control sequence optimization. "We integrate our learned dynamics model within a sampling-based optimization framework, MPPI, where future action sequences are optimized using Monte Carlo sampling."
  • temperature parameter: A scaling factor in softmax weighting that controls the sharpness of probability assignments. "where Jmin=minsJ(s)\mathcal{J}_{\min} = \min_s \mathcal{J}^{(s)} and λ>0\lambda>0 is the temperature parameter."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 7 tweets with 253 likes about this paper.