Road-Extender Agent for Traffic Detouring

Updated 8 May 2026

Road-Extender Agent is a deep reinforcement learning controller that optimizes detour strategies by balancing freeway and arterial flows in congested urban networks.
It employs a Markov Decision Process framework with both count-based and speed-based rewards, demonstrating stable training and significant improvements in traffic speed and delay reduction.
The agent adapts to incident-induced congestion through transfer learning and compliance modeling, ensuring scalable and robust performance in real-world simulations.

A Road-Extender Agent is a deep reinforcement learning (DRL) controller designed to dynamically optimize detouring strategies on congested urban freeways by leveraging available capacity on adjacent arterial networks. The agent is formulated as an adaptive policy that, under extreme congestion scenarios such as major incidents or breakdowns, automatically rebalances flow between freeways and local roads by determining the optimal fraction of vehicles to reroute at freeway exits. The objective is to mitigate the sharp decline in system throughput and speed that arises when critical density thresholds are exceeded. Implemented in a simulated real-world network context, the Road-Extender Agent has demonstrated substantial improvements in aggregate traffic metrics under incident-induced congestion (Dutta et al., 2023).

1. Problem Setting and Markov Decision Process Formulation

Extreme congestion on urban freeways—often triggered by lane-blocking incidents—pushes traffic states far beyond the critical density point $P^*=(\rho^*,Q^*)$ of the macroscopic fundamental diagram (MFD), causing severely reduced speeds and flows. Infrastructure expansion is typically infeasible in the short term; thus, operational interventions that make use of nearby arteriary capacity are preferable. The central control task is formulated as a Markov Decision Process (MDP) $(\mathcal{S},\mathcal{A},P,R,\gamma)$ , with the following specifications:

State space $\mathcal{S}$ : Composed of real-time traffic counts and speeds from 22 inductive-loop detectors spanning all lanes across five freeway segments and two off-ramps. The per-step observation is a 44-dimensional state vector,

$s_t = \{n_{s_i,d_j,\ell_k}(t),\;v_{s_i,d_j,\ell_k}(t)\}.$

Action space $\mathcal{A}$ : For each of two exit ramps, the agent selects a duty-cycle fraction $f\in\{0.0,0.1,0.2,0.3\}$ , governing the proportion of time the right-most lane’s vehicles are guided to exit during the next 10 minutes. The joint action space consists of $16$ possible $(f_1, f_2)$ pairs.
Transition dynamics $P$ : Governed implicitly by the SUMO microscopic traffic simulator, which models all relevant freeway–arterial interactions.
Reward functions $R$ : Two families are evaluated for per-action feedback:
- Count-based: $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 0 (freeway flow).
- Speed-based: $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 1 (mean speed footprint).
- The speed-based reward yielded more stable DRL training and directly captured system-level objectives.
Temporal granularity: Decisions occur every $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 2 minutes, with discount factor $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 3, tailored per RL algorithm.

2. DRL Methodologies and Network Architecture

Two model-free DRL methods, Deep Q-Network (DQN) and Advantage Actor-Critic (A2C), were implemented to learn the detouring policy. Each utilizes a feed-forward neural network with the following architecture:

Input Layer: 44-dimensional state vector.
Hidden Layers: Two fully connected layers, each of 256 units with tanh activations.
Output Layer:
- DQN: 16 Q-values representing the action space.
- A2C: Actor output as a softmax distribution over 16 actions; Critic outputs a single state-value.
Optimization:
- DQN: Huber loss, RMSProp optimizer.
- A2C: Policy gradient loss + value-function loss + entropy regularization term $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 4, optimized via SGD (PyTorch).
Exploration: Linear $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 5-decay schedule: $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 6 to $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 7 (DQN), with decay over $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 8 steps (DQN) / $(\mathcal{S},\mathcal{A},P,R,\gamma)$ 9 steps (A2C).

No specialized regularization beyond entropy was required for stable training. Hyperparameters such as batch size (32) and learning rate ( $\mathcal{S}$ 0) were consistent across agents.

3. Experimental Framework and Simulation Setting

The system was evaluated on a 2.6-mile, four-lane southbound section of I-5 in Shoreline, Washington, USA. The simulation network, built using OpenStreetMap and SUMO, features:

Network Topology: 4-lane freeway, two off-ramps at 0.8 mi (Exit 1) and 1.8 mi (Exit 2).
Instrumentation: Five mainline segments each with lane-specific loop detectors (total: 22), two detectors per exit ramp.
Demand Modeling: Morning peak (06:00–12:00) partitioned into 36 intervals of $\mathcal{S}$ 1; historical flow and speed data fitted to a Beta-distribution to produce stochastic yet realistic demand scenarios.
Baseline Policy: Vehicles exit according to historical fractions; no active detouring control.
Performance Metrics: Mean freeway speed (mph), total delay (veh-h), vehicle counts (flow). Emissions metrics were not directly measured, but noted as strongly correlated with delay.

4. Performance Analysis and System Impact

Learning with the speed-based reward $\mathcal{S}$ 2 resulted in significantly more stable A2C and DQN training compared to $\mathcal{S}$ 3, as measured by smoother convergence and reduced episodic variance. In 20 randomized incident scenarios involving a simulated 1-hour lane-blocking accident, results included:

Metric	No Action	DQN	A2C
Mean Speed (mph)	24.8	27.3 (+10%)	30.0 (+21%)
Peak Delay (veh-h)	450	410 (-9%)	360 (-20%)
Upstream Speed (mph)	18.5	25.0 (+35%)	27.5 (+49%)

Under severe congestion, the A2C agent achieved a 21% overall speed improvement and a 49% increase in upstream (pre-incident) speed, with a corresponding 20% reduction in peak delay relative to baseline.

Jointly monitoring count and speed during $\mathcal{S}$ 4-driven optimization demonstrated a trade-off in the MFD: moderate detouring raised both flow and speed by lowering density, but excessive offload depressed raw mainline vehicle counts while still improving speed.

5. Reward Function Trade-offs

A direct trade-off was observed between reward design choices. The count-based reward $\mathcal{S}$ 5 penalizes detours as fewer vehicles remain on the freeway, sometimes negating congestion-mitigation benefits. Conversely, the speed-based reward $\mathcal{S}$ 6 aligns more closely with the actual operational objectives by promoting higher travel speeds even at the cost of lower mainline vehicle counts. $\mathcal{S}$ 7 further enables more stable and robust agent training.

6. Human Compliance and Practical Considerations

Human driver compliance to detour recommendations was explicitly modeled by imposing compliance rates $\mathcal{S}$ 8. Decreasing compliance led to sharp reductions in achievable average speeds from perfect to 80% adherence, but speed gains plateaued below 80%, consistently outperforming baseline strategies. This suggests diminishing marginal returns in detour efficacy below a threshold of driver response, emphasizing the importance of compliance modeling for real-world feasibility.

7. Policy Generalization and Transfer Learning

To address the challenge of data sparsity in rare incident events and scalability to large networks, two transfer-learning experiments were conducted:

No-accident→Accident Transfer: A policy trained exclusively on standard (non-incident) congestion scenarios maintained approximately 95% effectiveness when deployed in incident-induced settings, minimizing the need for rare-event-specific training.
Single-exit→Dual-exit Transfer: Policies learned for a single exit ramp transferred effectively to other exits without the need for multi-exit joint training, thereby reducing the complexity of policy learning in expanded networks.

These findings indicate the Road-Extender Agent’s detouring policies are robust to operational changes and scalable across network topologies and demand conditions.

The Road-Extender Agent thus constitutes an effective, flexible, and generalizable framework for adaptive congestion mitigation through dynamic multi-exit detouring decisions, with empirical results demonstrating substantial system-wide traffic improvements in realistic, incident-driven scenarios (Dutta et al., 2023).

Markdown Report Issue Upgrade to Chat

References (1)

Deep Reinforcement Learning to Maximize Arterial Usage during Extreme Congestion (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Road-Extender Agent.