Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 62 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 213 tok/s Pro
GPT OSS 120B 458 tok/s Pro
Claude Sonnet 4 38 tok/s Pro
2000 character limit reached

RAPTOR: A Foundation Policy for Quadrotor Control (2509.11481v1)

Published 15 Sep 2025 in cs.RO, cs.AI, and cs.LG

Abstract: Humans are remarkably data-efficient when adapting to new unseen conditions, like driving a new car. In contrast, modern robotic control systems, like neural network policies trained using Reinforcement Learning (RL), are highly specialized for single environments. Because of this overfitting, they are known to break down even under small differences like the Simulation-to-Reality (Sim2Real) gap and require system identification and retraining for even minimal changes to the system. In this work, we present RAPTOR, a method for training a highly adaptive foundation policy for quadrotor control. Our method enables training a single, end-to-end neural-network policy to control a wide variety of quadrotors. We test 10 different real quadrotors from 32 g to 2.4 kg that also differ in motor type (brushed vs. brushless), frame type (soft vs. rigid), propeller type (2/3/4-blade), and flight controller (PX4/Betaflight/Crazyflie/M5StampFly). We find that a tiny, three-layer policy with only 2084 parameters is sufficient for zero-shot adaptation to a wide variety of platforms. The adaptation through In-Context Learning is made possible by using a recurrence in the hidden layer. The policy is trained through a novel Meta-Imitation Learning algorithm, where we sample 1000 quadrotors and train a teacher policy for each of them using Reinforcement Learning. Subsequently, the 1000 teachers are distilled into a single, adaptive student policy. We find that within milliseconds, the resulting foundation policy adapts zero-shot to unseen quadrotors. We extensively test the capabilities of the foundation policy under numerous conditions (trajectory tracking, indoor/outdoor, wind disturbance, poking, different propellers).

Summary

  • The paper presents a compact GRU-based recurrent policy that generalizes to unseen quadrotor platforms through zero-shot adaptation.
  • It employs meta-imitation learning with a dual-phase training pipeline, using 1000 teacher policies to distill robust control behavior.
  • Experimental results validate RAPTOR's emergent system identification and its reliable performance across diverse real-world and simulated conditions.

RAPTOR: A Foundation Policy for Quadrotor Control

Overview and Motivation

The RAPTOR framework introduces a highly adaptive, end-to-end neural network policy for quadrotor control, designed to generalize across a wide spectrum of quadrotor platforms and dynamic conditions. The central innovation is the training of a single, compact recurrent policy capable of zero-shot adaptation to unseen quadrotors, leveraging in-context learning via recurrence. This approach addresses the limitations of conventional RL-based controllers, which typically overfit to specific platforms and require retraining or explicit system identification for even minor hardware changes. Figure 1

Figure 1: (A) Motivation—comparison of adaptation capabilities between humans, RL-based policies, and RAPTOR; (B) RAPTOR architecture overview.

Methodology

Probabilistic Formulation and Architecture

Quadrotor control is formalized as a Bayes Adaptive POMDP, with the RAPTOR policy derived from probabilistic graphical modeling principles. The architecture consists of a three-layer GRU-based recurrent neural network with only 2084 parameters, enabling deployment on resource-constrained microcontrollers while maintaining real-time inference capabilities. Figure 2

Figure 2: (A) Bayesian network for quadrotor dynamics/control; (B) RAPTOR policy network architecture; (C) Illustration of emergent system identification via input/output reasoning.

Domain Randomization and Sampling

A physically plausible, factorized distribution over quadrotor dynamics parameters is constructed, covering mass, geometry, inertia, thrust curves, torque coefficients, and motor delays. Ancestral sampling is used to efficiently generate diverse quadrotor instances for training, ensuring broad coverage of real-world platforms. Figure 3

Figure 3: Probabilistic graphical model for ancestral sampling of quadrotors.

Meta-Imitation Learning

The training pipeline is divided into two phases:

  1. Pre-Training: 1000 teacher policies are trained via RL, each specialized for a sampled quadrotor. Teachers are overparameterized for robust convergence and observe full state information.
  2. Meta-Imitation Learning: The behaviors of all teachers are distilled into a single student policy. The student, lacking explicit knowledge of system parameters, must infer relevant dynamics from observation-action histories. On-policy imitation learning is employed, minimizing the MSE between student and teacher actions. Figure 4

    Figure 4: Meta-Imitation Learning algorithm schematic.

Experimental Results

Training Dynamics and Scaling

Pre-training reliably converges for all 1000 teachers, with robust episode lengths achieved after 100k steps. Meta-Imitation Learning enables the student policy to generalize to unseen quadrotors, with performance converging after ~1000 epochs. Scaling studies reveal that a hidden dimension of 16 suffices for high performance, and increasing the number of teachers improves generalization. Figure 5

Figure 5: (A) Pre-training learning curve; (B) Meta-imitation learning curve; (C) Pareto frontier: performance vs. number of teachers; (D) Pareto frontier: performance vs. policy size.

Emergent System Identification

The RAPTOR policy demonstrates emergent, implicit system identification. Linear probing of the latent state reveals strong predictive power for thrust-to-weight ratio (R2=0.949R^2 = 0.949, MSE = 0.047), indicating that the policy encodes relevant dynamics in its hidden state through in-context learning. Figure 6

Figure 6: Recovery from adverse initial condition; latent state trajectory and linear probe for system identification.

Real-World and Simulated Deployment

RAPTOR is deployed on 10 real quadrotors and 2 simulators, spanning a wide range of weights (32g–2.4kg), thrust-to-weight ratios (1.75–12), motor types, frame rigidity, and flight controllers. The policy adapts zero-shot to both in-distribution and out-of-distribution platforms, including flexible frames and mixed propeller configurations. Figure 7

Figure 7: Diverse set of 10 real and 2 simulated quadrotors used in experiments.

Trajectory Tracking and Robustness

Trajectory tracking experiments show that RAPTOR achieves RMSE errors comparable to state-of-the-art dedicated policies, with robust performance across all platforms. The policy generalizes to longer context windows and maintains repeatable performance over extended flights. Figure 8

Figure 8: Trajectory tracking results for all quadrotors.

Disturbance Recovery and Adaptation

RAPTOR exhibits rapid recovery from aggressive initial states, wind disturbances, physical poking, and payload changes. The policy adapts to mixed propeller configurations and maintains stable flight under significant perturbations. Figure 9

Figure 9: RAPTOR policy performance under various disturbances and configurations.

Computational Considerations

The separation of pre-training and meta-imitation learning enables embarrassingly parallel training, with pre-training distributed across multiple cores and meta-imitation learning requiring orders of magnitude less compute. The compact policy size allows deployment on microcontrollers with <<10% CPU utilization at high control frequencies.

Theoretical and Practical Implications

RAPTOR demonstrates that a small, recurrent neural policy can achieve robust, zero-shot adaptation to a wide range of quadrotor platforms, challenging the notion that end-to-end neural policies are fundamentally limited by Sim2Real gaps. The emergent system identification in the latent state suggests that meta-learning via in-context reasoning is a viable alternative to explicit system identification or domain randomization.

The framework's reproducibility, open-source codebase, and ease of integration into existing flight controllers position RAPTOR as a strong baseline for future research in adaptive robotic control.

Future Directions

Potential avenues for extension include:

  • Incorporating reward function variability for broader task generalization.
  • Scaling to more complex aerial vehicles and multi-agent scenarios.
  • Integrating trajectory lookahead for improved agile tracking.
  • Exploring attention-based architectures for longer context windows.

Conclusion

RAPTOR establishes a principled, practical approach for training foundation policies in quadrotor control, achieving robust zero-shot adaptation, emergent system identification, and efficient deployment. The results suggest that meta-imitation learning with broad domain randomization and recurrence is a powerful paradigm for adaptive control in robotics, with significant implications for both theory and real-world applications.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

alphaXiv

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube