Papers
Topics
Authors
Recent
Search
2000 character limit reached

GRU Policy in Reinforcement Learning

Updated 20 January 2026
  • GRU Policy is a reinforcement learning strategy that uses GRU networks to approximate value functions from fixed-length, one-hot encoded observations, enhancing learning in non-Markovian environments.
  • It integrates GRU cells within fitted Q-iteration and advantage-learning frameworks to improve sample efficiency, reduce reward variance, and accelerate convergence.
  • Empirical results demonstrate that GRU policies outperform LSTM and evolutionary methods in convergence speed, final rewards, and computational efficiency in discrete, partially observable tasks.

A Gated Recurrent Unit (GRU) Policy in reinforcement learning leverages the GRU neural architecture as a recurrent function approximator within fitted Q-iteration schemes, facilitating efficient policy learning in partially observable environments. The essence of the GRU policy is the use of GRU networks to estimate value functions, either QQ-values or Advantage-values, from fixed-length sequences of discrete, one-hot encoded observations. Distinct from classical memoryless approaches, a GRU-based policy is capable of utilizing temporal memory to infer unobserved state information, thereby improving sample complexity and final policy quality in non-Markovian domains (Steckelmacher et al., 2015).

1. GRU Cell Architecture

The GRU cell operates at each time step tt on input xtx_t and previous hidden state ht1h_{t-1}, executing the following computations:

  • Update gate:

zt=σ(Wzxt+Uzht1+bz)z_t = \sigma(W_z x_t + U_z h_{t-1} + b_z)

  • Reset gate:

rt=σ(Wrxt+Urht1+br)r_t = \sigma(W_r x_t + U_r h_{t-1} + b_r)

  • Candidate activation:

h~t=tanh(Whxt+Uh(rtht1)+bh)\tilde{h}_t = \tanh(W_h x_t + U_h (r_t \odot h_{t-1}) + b_h)

  • Hidden state update:

ht=(1zt)ht1+zth~th_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t

Here, σ\sigma denotes the logistic sigmoid, \odot is element-wise multiplication, and tt0, tt1, tt2 are trainable parameters. In the examined architecture, each GRU layer is configured with 100 units.

2. Integration into Fitted Q-Iteration

The GRU policy is implemented as the function approximator tt3 within the Neural Fitted Q-Iteration (NFQ) paradigm. The protocol comprises:

  • Data collection: For each of 5000 episodes (max 500 steps per episode), the agent observes tt4, selects tt5 via softmax over tt6 (temperature 0.5), receives tt7, and stores tt8.
  • Batch target computation: Every 10 episodes, for each transition, the target label is

tt9

with update parameters xtx_t0, xtx_t1.

  • Offline training: Sequences of up to 10 one-hot encoded observations (padded/truncated as needed) are fed into the network: Input → Dense(100, tanh) → GRU(100) → Dense(xtx_t2, linear). The mean-squared error between estimated and target values forms the loss. Optimization is performed with RMSProp or Adam, using batch size 10 and 2 epochs per update.

3. Advantage-Learning Variant

An alternative training regime is provided by Advantage-learning, where the value function takes the form xtx_t3 with xtx_t4. The parameter update involves the temporal difference error:

xtx_t5

where xtx_t6. The regression target is updated by

xtx_t7

with similar batch procedure as xtx_t8-learning. This approach tends to yield lower variance and faster convergence relative to standard xtx_t9-learning, except in stochastic environments where ht1h_{t-1}0-learning may exhibit marginally faster learning (Steckelmacher et al., 2015).

4. Implementation Details and Hyperparameters

Key parameters and design choices follow:

  • Input representation: Each observation scalar (e.g., ht1h_{t-1}1, orientation ht1h_{t-1}2) is one-hot encoded and concatenated, yielding an input vector up to length 15.
  • Sequence window: All training sequences are of fixed length 10 (with initial padding).
  • Network layers: Dense(100, tanh) → GRU(100) → Dense(ht1h_{t-1}3, linear); a softmax output is used for action selection during experience collection.
  • Training regime:
    • 5000 episodes per experiment
    • Max 500 steps per episode
    • Batch update every 10 episodes
    • 2 training epochs per batch
    • Batch size 10
    • Per-update learning rate ht1h_{t-1}4, discount ht1h_{t-1}5
  • Computational profile: On the referenced hardware, GRU agents completed the full training sequence in approximately half the CPU time required by LSTM agents.

5. Empirical Results and Performance Metrics

Performance was quantified using two primary metrics:

  • Learning time: The earliest step at which the mean reward over the ensuing 1000 steps surpasses –15 (with standard deviation ht1h_{t-1}6 20).
  • Learning performance: Maximum average reward achieved in any 1000-step interval.

Across multiple environments—especially partially observable grid worlds—GRU policies achieved faster convergence and higher final rewards than both LSTM and the evolutionary MUT1 architecture, with most pairwise improvements statistically significant at ht1h_{t-1}7 (final reward ht1h_{t-1}8 versus LSTM). Advantage-learning further reduced reward variance and improved convergence speed, except in the stochastic variant. Reward curves confirm that GRU-based policies achieve superior early and final rewards relative to alternatives (Steckelmacher et al., 2015).

6. Practical Recommendations and Insights

  • GRU cells combine parameter efficiency with robust temporal memory, making them well-suited for partially observable tasks with non-Markovian structure.
  • Maintaining fixed-length (e.g., 10-step) history windows for recurrent input provides a favorable balance of representational capacity and computational cost.
  • One-hot encoding of discrete observations is effective for input preprocessing.
  • Small-batch, low-epoch retraining (every 10 episodes, 2 epochs) helps prevent overfitting.
  • Monitoring both learning time and learning performance is critical when optimizing hyperparameters.
  • Empirical evidence supports the use of GRUs over LSTM and evolved architectures (MUT1) under the fitted Q-iteration protocol in partially observable environments, both in terms of sample efficiency and computational overhead (Steckelmacher et al., 2015).
Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Gated Recurrent Unit (GRU) Policy.