Bridging RL and MPC for mixed-integer optimal control with application to Formula 1 race strategies

Published 1 Apr 2026 in eess.SY | (2604.00826v1)

Abstract: We propose a hybrid reinforcement learning (RL) and model predictive control (MPC) framework for mixed-integer optimal control, where discrete variables enter the cost and dynamics but not the constraints. Existing hierarchical approaches use RL only for the discrete action space, leaving continuous optimization to MPC. Unlike these methods, we train the RL agent on the full hybrid action space, ensuring consistency with the cost of the underlying Markov decision process. During deployment, the RL actor is rolled out over the prediction horizon to parametrize an integer-free nonlinear MPC through the discrete action sequence and provide a continuous warm-start. The learned critic serves as a terminal cost to capture long-term performance. We prove recursive feasibility, and validate the framework on a Formula 1 race strategy problem. The hybrid method achieves near-optimal performance relative to an offline mixed-integer nonlinear program benchmark, outperforming a standalone RL agent. Moreover, the hybrid scheme enables adaptation to unseen disturbances through modular MPC extensions at zero retraining cost.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes a novel hybrid RL-MPC framework that jointly trains on discrete and continuous actions for mixed-integer optimal control.
It leverages an RL actor for generating discrete action rollouts and an MPC solver for refining continuous variables, achieving near-MINLP performance.
Empirical results on F1 race strategies demonstrate a reduction in suboptimality to under 10 ms per step with real-time computational feasibility.

Bridging RL and MPC for Mixed-Integer Optimal Control with Application to Formula 1 Race Strategies

Problem Formulation and Motivation

The paper addresses the intractability of solving mixed-integer optimal control problems (MIOCPs) arising in dynamical systems with both discrete and continuous control actions, such as pit stop selection and energy management in Formula 1 strategy. Traditional approaches utilizing either model predictive control (MPC) or reinforcement learning (RL) in isolation are inadequate for this hybrid setting: pure RL struggles with constraint satisfaction and function approximation errors, while MPC with mixed-integer variables rapidly becomes computationally prohibitive for online control. Existing hierarchical RL-MPC methods suffer from an inconsistent separation of discrete and continuous action handling, leading to policies not aligned with the true structure of the underlying Markov decision process.

Proposed Hybrid RL-MPC Framework

The core contribution is a hybrid RL-MPC framework wherein the RL agent is trained on the entire hybrid action space, i.e., both discrete and continuous actions, using an actor-critic approach. The critic network approximates the hybrid Q-function (cost-to-go $C^*$ ) in a manner consistent with the underlying MDP, unlike prior works that handle only discrete action sequences or decouple continuous-discrete interactions. During deployment, the RL actor generates a trajectory rollout over the MPC horizon, yielding a sequence of discrete actions and a warm-start for the continuous variables. The MPC is then solved with these discrete actions fixed, resulting in a continuous-only (integer-free) nonlinear program (NLP) that is computationally tractable for real-time application. The critic is employed as a terminal cost to encapsulate long-term performance, reducing horizon length demands and mitigating critic approximation limitations.

The theoretical properties of the proposed scheme are grounded in structural assumptions. Specifically, the paper requires that for any discrete action choice the continuous input can always enforce feasibility, allowing the MPC layer to guarantee recursive feasibility irrespective of the discrete sequence prescribed by the RL actor. The MPC employs a terminal set and local policy to ensure positive invariance and constraint satisfaction.

Empirical Validation on Formula 1 Race Strategy

Validation is carried out on a high-fidelity Formula 1 race strategy scenario involving 57 laps with hybrid-electric powertrains. The discrete decisions correspond to the pit stop timing and tire compound choice, while the continuous actions represent fuel and battery energy allocation per lap. The race strategy optimization aims to minimize total race time subject to regulatory and physical constraints.

The paper compares three approaches:

A full Mixed-Integer Nonlinear Program (MINLP) solution as ground-truth optimality,
A standalone RL agent trained on the hybrid action space but deployed without MPC refinement,
The proposed hybrid RL-MPC framework with varying horizon lengths.

Key results include:

The standalone RL agent obtains a strategy with pit stop timing matching the MINLP, but continuous energy allocations are suboptimal, resulting in a $+1.17$ second suboptimality (a significant margin in F1 racing). This shortfall is due to RL’s limited capacity to fine-tune continuous allocations in the presence of complex, temporally extended consequences.
The hybrid RL-MPC framework with a horizon of $N=15$ (a stint length lower bound) closes $99\%$ of the RL MINLP gap, with suboptimality of $+9.4$ ms, at an average solve time of $168$ ms per step. For $N \geq 40$ , performance matches the MINLP up to solver tolerances.
The trade-off between computation and performance is clearly quantified: longer prediction horizons yield diminishing suboptimality at increased computation time, paralleling classical MPC/DP theory.

Adaptation and Modularity

A salient practical advantage of the hybrid RL-MPC framework lies in architectural adaptability. The MPC layer can be easily extended to accommodate unforeseen disturbances, new constraints, or altered objectives without retraining the RL agent. This is demonstrated through a traffic scenario where the MPC is expanded to model “dirty air” lap time penalties behind a slower car. The RL policy remains unmodified while the MPC adapts fuel and energy discharge patterns in response, yielding a $5.19$ second race-time advantage over a nominal policy, a competitive margin in F1.

Theoretical Insights and Claims

The framework guarantees recursive feasibility for systems where discrete decisions do not affect feasibility (i.e., the continuous variables always admit feasible completion). This is a significant departure from existing RL-MPC hybrid methods and allows for clean integration of learning-based and optimization-based paradigms. However, the paper does not provide formal suboptimality bounds, instead relying on empirical demonstration and drawing connections to path-wise duality in approximate dynamic programming literature.

The computational load of the hybrid method is dominated by the NLP solve rather than RL or function approximation, and solutions scale favorably with action space size compared to mixed-integer branch-and-bound or online RL policy search.

Implications and Future Directions

This research offers a scalable solution for hybrid action space control problems in domains where high-dimensional, tightly coupled discrete-continuous decision-making is paramount. The ability of the framework to interpolate between pure RL (low computational latency, moderate optimality) and MINLP (offline optimality, prohibitive online cost) opens the path to real-time deployment in safety-critical or high-stakes environments, exemplified here by motorsport strategy.

Theoretically, the approach invites further extension: formal suboptimality characterizations of the hybrid value-augmented-MPC framework, and relaxation of the constraint-independence assumption for discrete actions via safety filters or control barrier functions for broader classes of hybrid MDPs.

Conclusion

The paper introduces a hybrid RL-MPC solution methodology for mixed-integer optimal control, validated on an F1 race strategy benchmark (2604.00826). By training the RL actor and critic on the full hybrid action space and leveraging MPC for constraint satisfaction and continuous refinement, the approach achieves better performance than standalone RL and near-optimality compared to MINLP with tractable online computation. Furthermore, the modular nature of the architecture supports rapid adaptation to scenario variation without costly retraining, making it broadly applicable in autonomous systems and advanced control. Future work should address theoretical guarantees and further relaxations to the discrete feasibility assumption.