RL and GNN: Automatic Intersection Management

Updated 19 November 2025

The paper shows that RL+GNN approaches outperform traditional controls by significantly improving throughput and reducing stop rates in mixed traffic.
RL+GNN systems represent traffic scenes as dynamic graphs where vehicles' kinematics and spatial relations are encoded to enable precise coordination.
Using PPO and TD3 with GNN message passing, the approach yields robust, generalizable policies across diverse intersection geometries and traffic conditions.

Automatic intersection management using reinforcement learning (RL) and graph neural networks (GNNs) concerns the joint optimization of vehicular motion through urban intersections, leveraging high-capacity machine learning models to encode complex spatial-temporal dependencies and driver intentions. Recent research demonstrates that RL+GNN-based approaches can strategically outperform classical non-learning rules (such as reservation, signalized control, or FIFO), enabling safe, cooperative, and high-throughput navigation in challenging traffic regimes, including both fully automated and mixed-traffic environments (Ma et al., 2020, Klimke et al., 2022, Klimke et al., 2022, Klimke et al., 2023).

1. Problem Formulation and Scene Representation

In the RL+GNN paradigm, automatic intersection management is typically modeled as a centralized or centralized-training multi-agent reinforcement learning problem, often cast as a Markov decision process (MDP) or partially observable MDP (POMDP). The full traffic scene is represented as a dynamic directed attributed graph $G = (V, E, U)$ , where vehicles correspond to nodes and their road/geometric relationships to edges.

Nodes ( $V$ ): Each node $v_i$ encapsulates features such as normalized longitudinal position $s_i$ , speed $v_i$ , acceleration $\tilde{a}_i$ , and control flags (e.g., "controllable" for automated vehicles) (Klimke et al., 2023). Behavioral mode and historical information (e.g., LSTM hidden state) are present in some frameworks (Ma et al., 2020).
Edges ( $E$ ): Edge types express scene topology (e.g., "same_lane", "crossing") and encode spatial and relational context through features like Mahalanobis or Euclidean distance $d_{ij}^{-1}$ , bearing $\chi_{ij}$ , and, for mixed traffic, priority relations $pr_{ij}$ (Klimke et al., 2022, Klimke et al., 2022, Klimke et al., 2023).
Edge Typing ( $U$ ):
- For mixed traffic, edge types distinguish agent interaction classes (AV/AV, AV/MV, MV/AV), supporting explicit modeling of partial controllability and intention ambiguity (Klimke et al., 2023).

This graph-centric abstraction supports variable numbers of vehicles and dynamic scene topology, capturing both geometric and behavioral complexities critical for coordinated intersection management.

2. Reinforcement Learning Formulation

The intersection management task is formulated as either an MDP or POMDP:

State ( $S$ ): The state is the current graph representation, encapsulating the kinematics and intentions of all agents. In POMDP settings, observations are either noisy or partially observed, e.g., vehicles' physical states plus unobservable behavioral modes or measurement noise (Ma et al., 2020, Klimke et al., 2023).
Action Space ( $A$ ): The action for each agent (often each AV) is a continuous acceleration command, or discrete speed setpoints for some applications (e.g., fixed set $\{0, 0.5, 3\}$ m/s) (Ma et al., 2020, Klimke et al., 2022, Klimke et al., 2022).
Transition ( $T$ ): Simulator advances agent states according to kinematic bicycle or car-following models (e.g., IDM, EIDM), with human-driven vehicles (MVs) governed by stochastic or deterministic behavioral models (Klimke et al., 2022, Klimke et al., 2023).
Reward Function ( $R$ ): Composite reward incentivizes high flow, penalizes low speed/stopping, encourages smooth acceleration, penalizes proximity violations and collisions. Representative reward terms include progress (velocity), action penalties, idle/stopping penalties, proximity/collision losses, and additional terms to discourage pathological yielding or “reluctance” behaviors in mixed traffic (Klimke et al., 2022, Klimke et al., 2022, Klimke et al., 2023).

Discount factors, learning rates, and TD3 or PPO updates conform to standard values for deep RL (Ma et al., 2020, Klimke et al., 2022, Klimke et al., 2022, Klimke et al., 2023).

3. Graph Neural Network Policies and Latent Inference

GNNs are employed to encode spatial and relational dependencies among vehicles, leveraging neighborhood message passing for both the RL policy and supporting auxiliary inference heads.

GNN Architectures:
- Relational Graph Convolutional Networks (R-GCNs) and variants with explicit edge features are state-of-the-art for these tasks (Klimke et al., 2022, Klimke et al., 2022, Klimke et al., 2023).
- Message passing aggregates information from edge-typed neighbors using max-pooling or other permutation-invariant functions, with multi-layer transformations to capture complex spatial interactions. Edge features (such as distance, bearing, and priorities) significantly enhance representational capacity, enabling the system to anticipate conflicts and negotiate crossing priorities (Klimke et al., 2022, Klimke et al., 2023).
Temporal History Encoding:
- Historical trajectories or observations are encoded either via LSTM per node or stacking over multiple decision steps (Ma et al., 2020).
Latent Behavioral Mode Inference:
- In scenarios with heterogeneous driver behavior, an auxiliary intention network $P_\phi(z_t^i|o_{1:t})$ is trained under supervised maximum likelihood, with output integrated as an input to the RL policy (separated inference architecture). This framework achieves high accuracy in online latent mode inference, yielding robust policy adaptation under behavioral diversity (Ma et al., 2020).

4. RL Training Algorithms and Optimization

RL-GNN intersection managers are typically trained using deep actor–critic methods:

Policy Optimization:
- Proximal Policy Optimization (PPO) is employed for discrete-action settings and robust policy improvement with clipped objectives (Ma et al., 2020).
- Twin Delayed Deep Deterministic Policy Gradient (TD3) is used for continuous control, employing two critics and target networks to avoid overestimation bias (Klimke et al., 2022, Klimke et al., 2022, Klimke et al., 2023).
Critic and Actor Heads:
- Both use GNN backbones; the actor outputs per-vehicle actions, while critics combine node and action embeddings through further message passing and global pooling.
Replay and Exploration:
- Training uses prioritized or uniform replay, Gaussian exploration noise, and staged curriculum learning for mixed-traffic settings (gradually increasing MV share during training) (Klimke et al., 2023).
Policy Generalization:
- Policies trained on sufficiently complex layouts generalize across intersection topologies (up to differences in lane/turn patterns), given that edge-feature representations encode topological structure (Klimke et al., 2022).

5. Handling Mixed Traffic and Realistic Uncertainty

Automatic intersection managers must explicitly address uncertainty in human driver intentions and measurement noise:

Mixed Traffic Scene Graphs:
- The scene graph distinguishes AV and MV nodes, with edge types representing AV/AV, AV/MV, and MV/AV coordination; MV/MV relationships are excluded since the planner cannot control MVs (Klimke et al., 2023).
Intention Ambiguity:
- For MVs with unknown turning intent, all potential conflicting trajectories are represented via overlapping edges to AV nodes, encouraging a conservative policy until disambiguation occurs (e.g., after an MV traverses a portion of the intersection) (Klimke et al., 2023).
Measurement Noise:
- Observation noise is modeled using AR(1) processes parameterized from real-world vehicle data, injected into $x$ , $y$ , $v$ , and $\psi$ (Klimke et al., 2023).
Reward Adjustment:
- Additional penalties, such as a “reluctance” term, disincentivize unnecessarily conservative AV behaviors (e.g., stopping well before a stop line to yield to uncertain MVs) (Klimke et al., 2023).

6. Evaluation, Generalization, and Performance

Comprehensive evaluation demonstrates the superiority and robustness of RL+GNN intersection managers compared to legacy baselines, including both synthetic and real-world traffic replay (Klimke et al., 2022, Klimke et al., 2022, Klimke et al., 2023):

Metric	Baseline (FIFO/PR/TL)	RL+GNN Approach	Notes/Scenario
Median Flow (veh/s)	0.90–1.10 [FIFO,PR]	1.30 (synthetic) [RL]	4-way, fully automated (Klimke et al., 2022)
Median Stop Rate (%)	32–99 (major/minor)	10/22 [RL]	Major/minor (Klimke et al., 2022)
Collision Rate (%)	0 (PR/FIFO)	0.03 (synthetic RL)	0.58 inD replay [RL] vs 1.92 (FIFO)
Median Flow (veh/s)	0.76 (TL), 0.83 (eRL)	0.83 (eRL)	Enhanced RL w/ edge features (Klimke et al., 2022)
Generalization	N/A	Robust (to smaller/larger layouts)	(Klimke et al., 2022)
Mixed-Traffic Flow	Increases >50% AV	Increases linearly w/ AV share	MVs benefit nearly equally (Klimke et al., 2023)
Mixed-Traffic Collisions	0.028 (eFIFO)	0.091 (RL+noise), 0.376 (eFIFO+noise)	RL more robust under noise

The RL+GNN planners demonstrably achieve higher throughput (up to +44% vs static priority rules), lower stop/delay rates, and successfully generalize to unseen intersection layouts, provided policy capacity is sufficient and edge features encode topology appropriately (Klimke et al., 2022, Klimke et al., 2022). In mixed traffic, both AVs and MVs benefit from increased speeds and reduced delay, with RL planners outperforming rule-based eFIFO at all automation penetration levels. Notably, the RL controller is robust to measurement noise, contrary to the strong performance degradation seen in eFIFO rules under identical conditions (Klimke et al., 2023).

7. Limitations and Future Directions

Current RL+GNN intersection-management systems exhibit noteworthy limitations:

Layout and Scenario Generalization: Training remains geometry-specific; extension to arbitrary intersection types or unseen lane configurations currently depends on diversity and complexity of training set (Klimke et al., 2022).
Mixed-Traffic Assumptions: Most frameworks assume full cooperation or partial compliance; integration of more realistic (non-cooperative) human driver models remains open (Klimke et al., 2022, Klimke et al., 2023).
Sensing and Communication: Delays, dropouts, and adversarial failures are not fully addressed; performance under degraded observation is a subject for continued research (Klimke et al., 2022, Klimke et al., 2023).
Longer-Horizon and Multi-Intersection Planning: Extensions to multi-intersection coordination and further integration with trajectory-level motion planners are ongoing (Klimke et al., 2022, Klimke et al., 2022).
Safe Deployment: Current collision avoidance is learned via reward shaping and statistical post-hoc sanity checks; certified safety guarantees or closed-form robustness margins are not yet established (Klimke et al., 2022).

Future work is expected to focus on scalable meta-RL for layout generalization, explicit modeling of partial/human actor policies, hybrid planning with physical constraints, deployment in highly realistic simulation or real traffic, and policy synthesis that incorporates pedestrians/cyclists and dynamic environmental context.

By integrating deep RL with graph neural architectures and explicit modeling of behavior, intention uncertainty, and spatial context, automatic intersection management based on RL-GNN frameworks achieves substantial advances in throughput, cooperativity, and robustness relative to classic control and legacy ML baselines. This approach provides a promising pathway toward scalable, adaptive coordination policies for urban intersections under both fully autonomous and mixed human-automation regimes (Ma et al., 2020, Klimke et al., 2022, Klimke et al., 2022, Klimke et al., 2023).