Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation

Published 12 Apr 2026 in cs.RO | (2604.10548v2)

Abstract: Obstacle avoidance is a fundamental vision-based task essential for enabling quadrotors to perform advanced applications. When planning the trajectory, existing approaches both on optimization and learning typically regard quadrotor as a point-mass model, giving path or velocity commands then tracking the commands by outer-loop controller. However, at high speeds, planned trajectories sometimes become dynamically infeasible in actual flight, which beyond the capacity of controller. In this paper, we propose a novel end-to-end policy that directly maps depth images to low-level bodyrate commands by reinforcement learning via differentiable simulation. The high-fidelity simulation in training after parameter identification significantly reduces all the gaps between training, simulation and real world. Analytical process by differentiable simulation provides accurate gradient to ensure efficiently training the low-level policy without expert guidance. The policy employs a lightweight and the most simple inference pipeline that runs without explicit mapping, backbone networks, primitives, recurrent structures, or backend controllers, nor curriculum or privileged guidance. By inferring low-level command directly to the hardware controller, the method enables full flight envelope control and avoids the dynamic-infeasible issue.Experimental results demonstrate that the proposed approach achieves the highest success rate and the lowest jerk among state-of-the-art baselines across multiple benchmarks. The policy also exhibits strong generalization, successfully deploying zero-shot in unseen, outdoor environments while reaching speeds of up to 7.5m/s as well as stably flying in the super-dense forest. This work is released at https://github.com/Fanxing-LI/avoidance.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces an end-to-end pipeline mapping depth images directly to low-level CTBR commands using gradient-driven RL.
It integrates high-fidelity 6-DOF quadrotor dynamics with differentiable simulation to achieve robust, agile, and smooth collision avoidance.
Experimental results demonstrate near-perfect success rates, low trajectory jerk, and effective zero-shot transfer from simulation to hardware.

End-to-End Quadrotor Control via High-Fidelity Differentiable Simulation

Introduction

The paper "Simple but Stable, Fast and Safe: Achieve End-to-end Control by High-Fidelity Differentiable Simulation" (2604.10548) advances vision-based collision avoidance for agile quadrotors by formulating an end-to-end pipeline that directly maps depth images to collective thrust and bodyrate (CTBR) commands using RL within a high-fidelity differentiable simulation. The design eliminates trajectory representation, mapping, and auxiliary controllers, handling system dynamics, smoothness, and perception constraints in a unified optimization. The method obviates the need for complex architectures such as LSTM, backbones, or action primitives, yet achieves superior generalization, flight stability, and safety over a wide range of benchmarks in both simulation and hardware.

Figure 1: Overview of the proposed method—policy training with BPTT through a differentiable high-fidelity simulation precisely aligned to real-world control responses.

Methodological Contributions

The key technical contribution is the direct integration of high-fidelity 6-DOF quadrotor dynamics and collision-based supervision into the differentiable simulation backpropagation pipeline. This approach fully exposes the policy network to all relevant system state transitions, actuator delays, and sensory feedback (depth maps), enabling gradient-driven reinforcement learning with analytical accuracy rather than critic approximation.

Figure 2: Training and inference pipeline achieves true end-to-end CTBR command output, removing trajectory/velocity abstraction and closing the perception-to-controller loop.

The pipeline avoids simplified kinematics or decoupled control stages, which tends to introduce dynamics-infeasibility and significant sim-to-real gaps in aggressive flight. It also utilizes a lightweight network (3-layer CNN + 2-layer MLP feature aggregator) without temporal recurrence, ensuring low-latency, resource-efficient onboard inference.

The reward structure incorporates velocity tracking, perception alignment, attitude stability, penalization of proximity to obstacles, and direct collision loss. Notably, the collision reward is implemented as a differentiable supervision over the predicted trajectory in the rollout, supporting robust and smooth avoidance behaviors.

Figure 3: Multimodal representation supervision—gradient flow from geometrical (collision-based) reward is backpropagated through the differentiable simulation to graphical features extracted from images.

Analytical derivations and implementations in PyTorch ensure that backpropagation-through-time (BPTT) can be accomplished through long, highly nonlinear dynamical graph composed of motor allocation, actuation delay, and high-frequency state updates, without gradient explosion/vanishing within modest horizons.

Gradient Field Reshaping

A critical insight in the paper is the identification and mitigation of the "braking over avoidance" bias induced by Euclidean Signed Distance Field (ESDF) collision penalties in dense/complex clutter, which manifest as "death zones" in the gradient field. The authors present a targeted transformation of the local gradient by adaptively rotating the effect direction laterally as a function of proximity and approach angle, thus reshaping the field to promote active avoidance rather than excessive braking.

Figure 4: ESDF map reshaping—transforming the gradient field from dominance by braking (red zones) to avoidance by adaptive lateral shifting of collision supervision.

This enables superior forward progress and stability, especially in super-dense scenarios, without sacrificing collision penalties.

Experimental Results

Simulated Benchmarks

An extensive evaluation across benchmarks (large-scale forests with densities up to 0.08 obstacles/m² and super-dense settings at 0.25 obstacles/m²) demonstrates that the method consistently yields near-perfect success rates, completion, and the lowest jerk of all compared methods (including EGO-Planner, Agile, YOPOv2, Newton), over speeds up to 12 m/s. The superiority is especially pronounced as both density and speed increase, where modular or imitation-based approaches degrade rapidly due to perception-latency, planning lag, or controller tracking failure.

Figure 5: Large-scale forest environments at four obstacle densities, used for systematic performance evaluation.

Figure 6: Policy rollouts (successes and comparison to baseline failures) in one scene, highlighting smoothness and robustness of direct low-level control.

Figure 7: Top—success rates across densities and speeds; bottom—distribution of average jerk, demonstrating consistently smoother trajectories for the proposed method.

In super-dense cases, the method maintains full reliability and higher effective forward velocity than even Newton (differentiable physics + point-mass), which suffers from conservative braking and residual dynamics mismatch. The policy does not require privileged or curriculum learning, further demonstrating the effectiveness of precise gradient propagation through high-fidelity models.

Figure 8: Panoramic views of super-dense environments at highest tested densities.

Real-World Deployment

Zero-shot policy transfer is demonstrated on custom quadrotor hardware with onboard state estimation and depth sensing, in regular (0.1 m⁻²) and super-dense (∼1 m⁻²) forests, as well as diverse urban and wild scenes. The drone reliably maintains forward-looking perception and safe velocities, automatically adjusting speed in response to increased clutter. The system can attain up to 7.5 m/s in regular forest and 5 m/s in super-dense settings, with minimal oscillation or aggressive corrective maneuvers. In contrast, Newton-like methods experience perception loss due to unavoidable vertical oscillations stemming from misaligned dynamics/controls.

Figure 9: Real-world experiment—flight trajectory and first-person view in regular forest at high speeds (up to 7.5 m/s).

Figure 10: Real-world trajectory and first-person view in super-dense forest—robust velocity adaptation and collision avoidance.

Figure 11: Generalization tests in various wild and urban environments confirm broad deployment capability.

Theoretical and Practical Implications

This work rigorously demonstrates that direct, differentiable, high-fidelity RL pipelines can accomplish robust, agile flight control without complex architectures, handcrafted controllers, or multi-stage heuristics. Analytical gradients propagated through physical models are shown to scale in complexity to real-world, multi-module systems, extending beyond idealized kinematics to encompass perception, actuation, and collision. The approach decisively addresses the pervasive dynamics-infeasibility and perception instability of prior modular or surrogate models in high-speed, real-world settings.

The proposed pipeline is highly applicable to other complex robotic tasks where low-level control and perception are tightly integrated, and where the sim-to-real gap is dominated by unmodeled or misrepresented dynamics.

Limitations and Future Directions

While the approach is unrivaled for generic collision avoidance and safety control, it does not explicitly encode global planning competency for non-convex environments (e.g., mazes), and may require high-level directional goals for full autonomy in such scenarios. The lack of recurrency, while beneficial for stability and latency, might limit adaptability in longer-horizon, heavily occluded, or multi-agent contexts. Integration with lightweight global planners or hierarchical RL is a natural avenue for future exploration.

Conclusion

By employing high-fidelity differentiable simulation and BPTT-driven RL, the paper achieves state-of-the-art stable, agile UAV control with the simplest known end-to-end pipeline—eschewing complex architectures and supporting robust zero-shot deployment in both simulated and real-world dense environments. This research sets a reference for the efficient exploitation of analytical gradients and high-fidelity modeling in robotic learning, with substantial implications for deployable, computation-aware aerial autonomy.

Markdown Report Issue