GRU Policy in Reinforcement Learning

Updated 20 January 2026

GRU Policy is a reinforcement learning strategy that uses GRU networks to approximate value functions from fixed-length, one-hot encoded observations, enhancing learning in non-Markovian environments.
It integrates GRU cells within fitted Q-iteration and advantage-learning frameworks to improve sample efficiency, reduce reward variance, and accelerate convergence.
Empirical results demonstrate that GRU policies outperform LSTM and evolutionary methods in convergence speed, final rewards, and computational efficiency in discrete, partially observable tasks.

A Gated Recurrent Unit (GRU) Policy in reinforcement learning leverages the GRU neural architecture as a recurrent function approximator within fitted Q-iteration schemes, facilitating efficient policy learning in partially observable environments. The essence of the GRU policy is the use of GRU networks to estimate value functions, either $Q$ -values or Advantage-values, from fixed-length sequences of discrete, one-hot encoded observations. Distinct from classical memoryless approaches, a GRU-based policy is capable of utilizing temporal memory to infer unobserved state information, thereby improving sample complexity and final policy quality in non-Markovian domains (Steckelmacher et al., 2015).

1. GRU Cell Architecture

The GRU cell operates at each time step $t$ on input $x_t$ and previous hidden state $h_{t-1}$ , executing the following computations:

Update gate:

$z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)$

Reset gate:

$r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)$

Candidate activation:

$\tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h)$

Hidden state update:

$h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

Here, $\sigma$ denotes the logistic sigmoid, $\odot$ is element-wise multiplication, and $t$ 0, $t$ 1, $t$ 2 are trainable parameters. In the examined architecture, each GRU layer is configured with 100 units.

2. Integration into Fitted Q-Iteration

The GRU policy is implemented as the function approximator $t$ 3 within the Neural Fitted Q-Iteration (NFQ) paradigm. The protocol comprises:

Data collection: For each of 5000 episodes (max 500 steps per episode), the agent observes $t$ 4, selects $t$ 5 via softmax over $t$ 6 (temperature 0.5), receives $t$ 7, and stores $t$ 8.
Batch target computation: Every 10 episodes, for each transition, the target label is

$t$ 9

with update parameters $x_t$ 0, $x_t$ 1.

Offline training: Sequences of up to 10 one-hot encoded observations (padded/truncated as needed) are fed into the network: Input → Dense(100, tanh) → GRU(100) → Dense( $x_t$ 2, linear). The mean-squared error between estimated and target values forms the loss. Optimization is performed with RMSProp or Adam, using batch size 10 and 2 epochs per update.

3. Advantage-Learning Variant

An alternative training regime is provided by Advantage-learning, where the value function takes the form $x_t$ 3 with $x_t$ 4. The parameter update involves the temporal difference error:

$x_t$ 5

where $x_t$ 6. The regression target is updated by

$x_t$ 7

with similar batch procedure as $x_t$ 8-learning. This approach tends to yield lower variance and faster convergence relative to standard $x_t$ 9-learning, except in stochastic environments where $h_{t-1}$ 0-learning may exhibit marginally faster learning (Steckelmacher et al., 2015).

4. Implementation Details and Hyperparameters

Key parameters and design choices follow:

Input representation: Each observation scalar (e.g., $h_{t-1}$ 1, orientation $h_{t-1}$ 2) is one-hot encoded and concatenated, yielding an input vector up to length 15.
Sequence window: All training sequences are of fixed length 10 (with initial padding).
Network layers: Dense(100, tanh) → GRU(100) → Dense( $h_{t-1}$ 3, linear); a softmax output is used for action selection during experience collection.
Training regime:
- 5000 episodes per experiment
- Max 500 steps per episode
- Batch update every 10 episodes
- 2 training epochs per batch
- Batch size 10
- Per-update learning rate $h_{t-1}$ 4, discount $h_{t-1}$ 5
Computational profile: On the referenced hardware, GRU agents completed the full training sequence in approximately half the CPU time required by LSTM agents.

5. Empirical Results and Performance Metrics

Performance was quantified using two primary metrics:

Learning time: The earliest step at which the mean reward over the ensuing 1000 steps surpasses –15 (with standard deviation $h_{t-1}$ 6 20).
Learning performance: Maximum average reward achieved in any 1000-step interval.

Across multiple environments—especially partially observable grid worlds—GRU policies achieved faster convergence and higher final rewards than both LSTM and the evolutionary MUT1 architecture, with most pairwise improvements statistically significant at $h_{t-1}$ 7 (final reward $h_{t-1}$ 8 versus LSTM). Advantage-learning further reduced reward variance and improved convergence speed, except in the stochastic variant. Reward curves confirm that GRU-based policies achieve superior early and final rewards relative to alternatives (Steckelmacher et al., 2015).

6. Practical Recommendations and Insights

GRU cells combine parameter efficiency with robust temporal memory, making them well-suited for partially observable tasks with non-Markovian structure.
Maintaining fixed-length (e.g., 10-step) history windows for recurrent input provides a favorable balance of representational capacity and computational cost.
One-hot encoding of discrete observations is effective for input preprocessing.
Small-batch, low-epoch retraining (every 10 episodes, 2 epochs) helps prevent overfitting.
Monitoring both learning time and learning performance is critical when optimizing hyperparameters.
Empirical evidence supports the use of GRUs over LSTM and evolved architectures (MUT1) under the fitted Q-iteration protocol in partially observable environments, both in terms of sample efficiency and computational overhead (Steckelmacher et al., 2015).

Markdown Report Issue Upgrade to Chat

References (1)

An Empirical Comparison of Neural Architectures for Reinforcement Learning in Partially Observable Environments (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Recurrent Unit (GRU) Policy.

GRU Policy in Reinforcement Learning

1. GRU Cell Architecture

2. Integration into Fitted Q-Iteration

3. Advantage-Learning Variant

4. Implementation Details and Hyperparameters

5. Empirical Results and Performance Metrics

6. Practical Recommendations and Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

GRU Policy in Reinforcement Learning

1. GRU Cell Architecture

2. Integration into Fitted Q-Iteration

3. Advantage-Learning Variant

4. Implementation Details and Hyperparameters

5. Empirical Results and Performance Metrics

6. Practical Recommendations and Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research