Deep Q-Network Traffic Signal Control Agent
- Deep Q-Network Traffic Signal Control Agent is an adaptive algorithm that learns directly from raw image-based traffic states, eliminating the need for hand-crafted features.
- The method employs a minimal binary action space with NSG and EWG, and stabilizes training through experience replay and target network synchronization.
- Experiments show the agent can achieve approximately 68% delay reduction and 73% queue reduction compared to traditional fixed-time and shallow neural network controllers.
A Deep Q-Network Traffic Signal Control Agent (DQTSCA) is an adaptive traffic signal control algorithm that uses deep Q-learning to map high-dimensional intersection state observations directly to discrete phase actuations. The methodology eschews hand-crafted traffic features, learning end-to-end from raw environment data—specifically, image-based traffic states generated by a microsimulator such as SUMO. This approach allows the system to generalize across intersection geometries while optimizing for reduced vehicle delay and queue lengths. The following sections detail the core design of DQTSCA as formulated by Mousavi et al. (Mousavi et al., 2017).
1. Observation and State Representation
The agent receives at each decision epoch a raw image snapshot from the SUMO-GUI, representing the visual traffic state of the intersection:
- Raw input: RGB screenshot of the intersection at time step .
- Preprocessing: Images are converted to grayscale and resized to 128×128 pixels.
- Temporal information: The state is constructed by stacking the last four consecutive preprocessed frames, yielding a 128×128×4 tensor. This allows the network to extract not only static configurations (vehicle positions, queues) but also implicitly encode motion (vehicle speeds and directions).
This approach avoids use of any explicit, hand-designed traffic features (e.g., queue lengths, velocities), relying entirely on the representational capacity of deep convolutional networks.
2. Action Space and Traffic Phase Modeling
The action space is deliberately minimal:
- : At each decision step, the agent selects either North-South Green (all north-south directions green, east-west directions red), or East-West Green (all east-west directions green, north-south red).
- Phase actuation: The simulator transitions to the chosen phase immediately; yellow or all-red intervals are not explicitly modeled in the RL agent (this is managed by the simulator or system side as needed).
The binary action space simplifies policy learning and remains sufficient to demonstrate strong reductions in delay and queueing with respect to baselines.
3. Q-Network Architecture
The agent employs a deep convolutional Q-network (DQN) and a separate target network:
- Online Q-Network :
- Layer 1: Convolutional, 16 filters of size , stride 4, ReLU activation.
- Layer 2: Convolutional, 32 filters of size , stride 2, ReLU activation.
- Layer 3: Fully connected, 256 units, ReLU.
- Output: Fully connected, 2 units (one per action), linear activation.
- Target Q-Network :
Has identical architecture; parameters are periodically synchronized (copied) from the online network every update steps.
This configuration provides sufficient representational power to process raw stacked vision input, automatically extracting spatial queue, lane occupancy, and motion cues essential to control.
4. Deep Q-Learning and Training Protocol
The DQTSCA is trained using standard DQN methods with key stabilizations:
- Reward function:
At step ,
where is the cumulative delay of all vehicles up to step . Thus, indicates total delay reduction.
- Bellman target for parameter update:
with discount .
- Loss (mean squared TD error, over minibatch ):
- Gradient-based parameter update (Adam optimizer, ):
- Experience replay:
Transitions are stored in a replay buffer (size not explicitly stated, assumed 50,000). Each SGD step samples random minibatches of 32 transitions to break sample autocorrelation.
- Exploration:
-greedy policy with annealing (schedule not specified).
- Training schedule:
Training for 2 million steps (1,050 epochs, each 10 SUMO episodes). After every 10 training episodes, 5 evaluation episodes are run for metric tracking.
- Target network update:
every steps (specific not recorded, follows DQN defaults).
Training is stable throughout, without the reward divergence or oscillations reported in some previous vision-based traffic DQN experiments.
5. Benchmarks, Experimental Protocol, and Results
- Simulator: SUMO-GUI v0.28; 4-way intersection, each approach 4 incoming and 4 outgoing lanes; vehicles generated per-route with constant probability 0.1 per step.
- Baselines:
- Fixed-time controller: Always alternates green every seconds, with equal split for all phases.
- Shallow neural network (SNN): RL agent with queue length and phase as state, single hidden layer (64 units).
- Evaluation metrics:
- Average cumulative delay per vehicle
- Average queue length (vehicles waiting per intersection)
- Average episode reward
- Results (mean over last 100 epochs, ):
- DQTSCA vs. SNN:
- Delay reduced by 68%
- Queue reduced by 73%
Learning curves exhibit monotonic improvement in reward, decreasing delay and queue length. No evidence of instability or unlearning events is reported during extended training.
6. Discussion: Representation, Stability, and Limitations
- Raw image input enables generalization across intersection geometries without relying on hand-crafted feature extraction, detector logic, or lane/parsing rules.
- Convolutional networks efficiently extract latent spatial (vehicle presence, queues) and motion (frame-to-frame change yields speed, flow) features.
- Replay buffer and target network are critical for stabilizing learning from high-dimensional, rapidly varying input streams. The approach avoids instability issues observed in earlier DQN traffic signal experiments that lacked these mechanisms.
Limitations and Potential Extensions
- Scope: The reported DQTSCA operates solely on isolated intersections; generalization to arterial corridors or networked signals is not addressed in this configuration.
- Phase modeling: Yellow and all-red intervals (clearance phases) are not explicitly represented and are not agent-controlled. Their introduction would be required for deployment in real-world controllers.
- Exploration and replay buffer sizes: Schedules and buffer capacities are not fully specified and may merit further tuning for optimal generalization and learning efficiency.
- Reward function: The present design uses only delay as feedback; richer signals (e.g., incorporating throughput, fairness, emissions) could be considered in future iterations.
7. Significance and Contributions
This DQTSCA formulation demonstrates that direct deep RL methods leveraging raw vision as state, minimal action spaces, and carefully structured deep Q-learning can yield stable, high-performance adaptive traffic signal control in simulation. The agent achieves large reductions in vehicle queueing and delay compared to both fixed-time and conventional shallow RL controllers. The design provides a strong baseline for researchers developing DQN agents for traffic control, emphasizing stability benefits from experience replay and target networks, as well as the value of end-to-end perceptual learning pipelines for urban mobility control (Mousavi et al., 2017).