Assessment of Reward Functions for Reinforcement Learning Traffic Signal Control under Real-World Limitations

Published 26 Aug 2020 in cs.AI, cs.LG, cs.NE, and eess.SY | (2008.11634v2)

Abstract: Adaptive traffic signal control is one key avenue for mitigating the growing consequences of traffic congestion. Incumbent solutions such as SCOOT and SCATS require regular and time-consuming calibration, can't optimise well for multiple road use modalities, and require the manual curation of many implementation plans. A recent alternative to these approaches are deep reinforcement learning algorithms, in which an agent learns how to take the most appropriate action for a given state of the system. This is guided by neural networks approximating a reward function that provides feedback to the agent regarding the performance of the actions taken, making it sensitive to the specific reward function chosen. Several authors have surveyed the reward functions used in the literature, but attributing outcome differences to reward function choice across works is problematic as there are many uncontrolled differences, as well as different outcome metrics. This paper compares the performance of agents using different reward functions in a simulation of a junction in Greater Manchester, UK, across various demand profiles, subject to real world constraints: realistic sensor inputs, controllers, calibrated demand, intergreen times and stage sequencing. The reward metrics considered are based on the time spent stopped, lost time, change in lost time, average speed, queue length, junction throughput and variations of these magnitudes. The performance of these reward functions is compared in terms of total waiting time. We find that speed maximisation resulted in the lowest average waiting times across all demand levels, displaying significantly better performance than other rewards previously introduced in the literature.

Abstract PDF Upgrade to Chat

Citations (14)

View on Semantic Scholar

Summary

The paper demonstrates that a speed-based reward function significantly reduces waiting times compared to traditional metrics.
The study employs a deep Q-network and realistic SUMO simulation, using calibrated sensor data to mimic various traffic demand scenarios.
The findings highlight that context-sensitive rewards, such as queue squared and average speed adjustments, adapt effectively to fluctuating traffic conditions.

Summary of "Assessment of Reward Functions for Reinforcement Learning Traffic Signal Control under Real-World Limitations"

Introduction

The paper "Assessment of Reward Functions for Reinforcement Learning Traffic Signal Control under Real-World Limitations" (2008.11634) addresses the challenge of optimizing adaptive traffic signal control using deep reinforcement learning (RL) algorithms. Traditional approaches like SCOOT and SCATS are limited by their dependence on regular calibration and manual plan curation and are not well-suited to the complex, multimodal demands of modern urban traffic systems. The paper evaluates multiple reward functions within a deep RL framework, aiming to determine which is most effective in minimizing traffic congestion at a real-world junction simulation in Greater Manchester, UK, while accounting for realistic constraints such as sensor inputs and traffic controller limitations.

Reinforcement Learning Framework

The RL framework utilizes a Markov Decision Process (MDP), incorporating state observations, actions, transition probabilities, rewards, and a discount factor. The agent, implemented with a Deep Q Network (DQN) in PyTorch, leverages sensor data from vision-based detectors to decide on traffic signal stages in real-time, using observed data on flow, queue length, and vehicle speed.

Figure 1: Schematic representation of information flows between Environment and Agent in a Reinforcement Learning framework.

The MDP formulation allows the agent to learn an optimal policy $\pi^*$ that maximizes expected future rewards, computed by the discounted future reward $R_t$ . This is achieved by updating Q-values through Bellman equations, guiding the agent towards effective traffic signal solutions.

Reward Function Definitions

The paper tests various reward functions based on queue length, waiting time, time lost, average speed, and throughput. Each function encapsulates different aspects of traffic system states and operational goals:

Queue Length Rewards: Functions like Queue Length and Queue Squared prioritize minimizing traffic queues at each sensor, with Queue Squared granting higher penalties for longer queues.
Waiting Time Rewards: Functions like Wait Time and Delta Wait Time emphasize reducing cumulative waiting durations within the intersection.
Time Lost Rewards: Focused on delays from maximum possible speed, such functions penalize deviation from optimal traffic speed.
Average Speed Rewards: Encouraging higher speeds without congestion, these rewards aim to maintain traffic flow efficiency.
Throughput Rewards: These metrics value the number of vehicles passing through, aligning with systemic throughput goals.
Figure 2: Intersection model in SUMO.

Experimental Setup and Results

The experimental design includes training agencies under varying traffic demands—sub-saturated, near-saturated, and over-saturated scenarios—using SUMO for simulation and real data for calibration. The evaluation focused on minimizing total waiting times as a performance indicator.

Findings: Speed maximization yields the lowest average waiting times across all demand levels, outperforming queue-based and time-based reward functions. Average Speed Adjusted by Demand showed improved results, suggesting adaptation to contextual traffic conditions.
Figure 3: Distribution and medians of average waiting time in seconds across agents in Low Demand scenario. Sub-saturation demand of 1714 vehicles/hour (1 vehicle/2.1 seconds).

Comparative Performance

The results indicated inconsistency in performance of traditional reward functions across different traffic scenarios. Queue Squared performed well in scenarios with varying demand levels, displaying adaptability to changing traffic densities. Conversely, throughput-based rewards consistently underperformed, highlighting the inadequacy of focusing solely on vehicular passage rates.

Figure 4: Stacked bar chart of Average Waiting Time across scenarios.

Conclusion

The research demonstrates that RL, specifically agents trained with speed-based rewards, significantly reduces congestion compared to incumbent algorithms like Maximum Occupancy and Vehicle Actuated Controllers. The study suggests scalable applications across various intersections, given sufficient input data accuracy and system representation fidelity.

Future work posits exploring pedestrian integration and alternative RL architectures beyond DQNs for potential performance gains. The findings underscore the importance of context-sensitive reward functions tailored to real-world traffic dynamics, setting a precedent for adaptive urban traffic control innovations.

This work offers valuable insights into the evolution of RL applications in traffic systems, continuing to push boundaries in smart city infrastructures.