GRU Policy in Reinforcement Learning
- GRU Policy is a reinforcement learning strategy that uses GRU networks to approximate value functions from fixed-length, one-hot encoded observations, enhancing learning in non-Markovian environments.
- It integrates GRU cells within fitted Q-iteration and advantage-learning frameworks to improve sample efficiency, reduce reward variance, and accelerate convergence.
- Empirical results demonstrate that GRU policies outperform LSTM and evolutionary methods in convergence speed, final rewards, and computational efficiency in discrete, partially observable tasks.
A Gated Recurrent Unit (GRU) Policy in reinforcement learning leverages the GRU neural architecture as a recurrent function approximator within fitted Q-iteration schemes, facilitating efficient policy learning in partially observable environments. The essence of the GRU policy is the use of GRU networks to estimate value functions, either -values or Advantage-values, from fixed-length sequences of discrete, one-hot encoded observations. Distinct from classical memoryless approaches, a GRU-based policy is capable of utilizing temporal memory to infer unobserved state information, thereby improving sample complexity and final policy quality in non-Markovian domains (Steckelmacher et al., 2015).
1. GRU Cell Architecture
The GRU cell operates at each time step on input and previous hidden state , executing the following computations:
- Update gate:
- Reset gate:
- Candidate activation:
- Hidden state update:
Here, denotes the logistic sigmoid, is element-wise multiplication, and 0, 1, 2 are trainable parameters. In the examined architecture, each GRU layer is configured with 100 units.
2. Integration into Fitted Q-Iteration
The GRU policy is implemented as the function approximator 3 within the Neural Fitted Q-Iteration (NFQ) paradigm. The protocol comprises:
- Data collection: For each of 5000 episodes (max 500 steps per episode), the agent observes 4, selects 5 via softmax over 6 (temperature 0.5), receives 7, and stores 8.
- Batch target computation: Every 10 episodes, for each transition, the target label is
9
with update parameters 0, 1.
- Offline training: Sequences of up to 10 one-hot encoded observations (padded/truncated as needed) are fed into the network: Input → Dense(100, tanh) → GRU(100) → Dense(2, linear). The mean-squared error between estimated and target values forms the loss. Optimization is performed with RMSProp or Adam, using batch size 10 and 2 epochs per update.
3. Advantage-Learning Variant
An alternative training regime is provided by Advantage-learning, where the value function takes the form 3 with 4. The parameter update involves the temporal difference error:
5
where 6. The regression target is updated by
7
with similar batch procedure as 8-learning. This approach tends to yield lower variance and faster convergence relative to standard 9-learning, except in stochastic environments where 0-learning may exhibit marginally faster learning (Steckelmacher et al., 2015).
4. Implementation Details and Hyperparameters
Key parameters and design choices follow:
- Input representation: Each observation scalar (e.g., 1, orientation 2) is one-hot encoded and concatenated, yielding an input vector up to length 15.
- Sequence window: All training sequences are of fixed length 10 (with initial padding).
- Network layers: Dense(100, tanh) → GRU(100) → Dense(3, linear); a softmax output is used for action selection during experience collection.
- Training regime:
- 5000 episodes per experiment
- Max 500 steps per episode
- Batch update every 10 episodes
- 2 training epochs per batch
- Batch size 10
- Per-update learning rate 4, discount 5
- Computational profile: On the referenced hardware, GRU agents completed the full training sequence in approximately half the CPU time required by LSTM agents.
5. Empirical Results and Performance Metrics
Performance was quantified using two primary metrics:
- Learning time: The earliest step at which the mean reward over the ensuing 1000 steps surpasses –15 (with standard deviation 6 20).
- Learning performance: Maximum average reward achieved in any 1000-step interval.
Across multiple environments—especially partially observable grid worlds—GRU policies achieved faster convergence and higher final rewards than both LSTM and the evolutionary MUT1 architecture, with most pairwise improvements statistically significant at 7 (final reward 8 versus LSTM). Advantage-learning further reduced reward variance and improved convergence speed, except in the stochastic variant. Reward curves confirm that GRU-based policies achieve superior early and final rewards relative to alternatives (Steckelmacher et al., 2015).
6. Practical Recommendations and Insights
- GRU cells combine parameter efficiency with robust temporal memory, making them well-suited for partially observable tasks with non-Markovian structure.
- Maintaining fixed-length (e.g., 10-step) history windows for recurrent input provides a favorable balance of representational capacity and computational cost.
- One-hot encoding of discrete observations is effective for input preprocessing.
- Small-batch, low-epoch retraining (every 10 episodes, 2 epochs) helps prevent overfitting.
- Monitoring both learning time and learning performance is critical when optimizing hyperparameters.
- Empirical evidence supports the use of GRUs over LSTM and evolved architectures (MUT1) under the fitted Q-iteration protocol in partially observable environments, both in terms of sample efficiency and computational overhead (Steckelmacher et al., 2015).