Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Reinforcement Recurrent Network (DRRN)

Updated 9 March 2026
  • DRRN is a reinforcement learning architecture that decouples state and action embeddings to enable robust Q-value approximations in dynamic, natural language environments.
  • It employs dual feed-forward networks for text-based action spaces and LSTM-based variants for memory-enabled control in POMDPs.
  • Empirical results demonstrate that DRRN outperforms baseline models in text-based games and disturbance-prone systems, highlighting its efficiency and adaptability.

The Dynamic Reinforcement Recurrent Network (DRRN) refers to a class of reinforcement learning (RL) architectures designed to address environments where both states and actions are structured or variable—commonly as sequences or sets of natural language strings—and/or where the environment is only partially observable, necessitating memory and sequence modeling. There are effectively two complementary lines of DRRN methodology: the original Deep Reinforcement Relevance Network, which disentangles state and action embedding for large, natural language action-spaces (He et al., 2015), and the recurrent, memory-based DRRN variants that leverage LSTM-based sequence modeling for control in Partially Observable Markov Decision Processes (POMDPs) subject to unknown and time-varying disturbances (Omi et al., 2023). Both frameworks enable robust Q-value approximation in nontrivial, non-tabular settings, but employ distinct architectural strategies.

1. Foundational Architectures of DRRN

The DRRN introduced in “Deep Reinforcement Learning with a Natural Language Action Space” (He et al., 2015) targets problems where both the state and available actions are described by unconstrained natural language. The architecture uses two deep, parameter-disjoint feed-forward networks: one encodes the state text, and the other encodes each action text, yielding continuous embeddings for each. An interaction function g(,)g(\cdot,\cdot)—typically the inner product—scores each (state, action) embedding pair to approximate the Q-function Q(s,a)Q(s,a).

In parallel, recurrent DRRN variants are developed to address partial observability and dynamic disturbances, as described in “Dynamic deep-reinforcement-learning algorithm in Partially Observed Markov Decision Processes” (Omi et al., 2023). Here, the agent employs LSTM-based actor and critic networks to aggregate historical sequences of observations and actions for internal belief-state estimation, supporting policy optimization under incomplete information.

2. Mathematical Formulation and Network Structure

2.1 DRRN for Natural Language Action Spaces

Let sSs\in\mathcal S (state text) and aAa\in\mathcal A (action text). The embedding functions are

ϕs(s;θs)=hL,s,ϕa(a;θa)=hL,a\phi_s(s; \theta_s) = h_{L,s}, \qquad \phi_a(a; \theta_a) = h_{L,a}

where hl,s=f(Wl,shl1,s+bl,s)h_{l,s} = f(W_{l,s} h_{l-1,s} + b_{l,s}) and hl,a=f(Wl,ahl1,a+bl,a)h_{l,a} = f(W_{l,a} h_{l-1,a} + b_{l,a}), starting from bag-of-words or bag-of-n-grams representations. Q-values are estimated via

Q(s,a;Θ)=g(ϕs(s),ϕa(a))Q(s,a;\Theta) = g(\phi_s(s),\phi_a(a))

with g(u,v)=uvg(u,v)=u^\top v (inner product) or g(u,v)=uMvg(u,v)=u^\top M v (bilinear).

2.2 Recurrent DRRN for POMDPs

Given a POMDP (S,A,O,T,Z,R)(\mathcal S, \mathcal A, \mathcal O, \mathcal T, \mathcal Z, \mathcal R), the history Itc=(o0...ot,a0...at1)I_t^{\mathfrak c} = (o_0...o_t, a_0...a_{t-1}) is summarized by the LSTM hidden state hth_t. The critic evaluates

Qπ(ht,at)=E[k=0γkrt+k+1ht,at]Q^\pi(h_t, a_t) = E \bigl[\sum_{k=0}^\infty \gamma^k r_{t+k+1} \mid h_t, a_t\bigr]

with the Bellman target (for TD3 backbone)

yt=rt+γmini=1,2Qˉθi(ht+1,πˉϕ(ht+1))y_t = r_t + \gamma\, \min_{i=1,2} \bar Q_{\theta_i'}(h_{t+1}, \bar\pi_{\phi'}(h_{t+1}))

and loss

L(θi)=E(ht,at,rt,ht+1)D[(Qθi(ht,at)yt)2]L(\theta_i) = E_{(h_t,a_t,r_t,h_{t+1})\sim D}[(Q_{\theta_i}(h_t,a_t)-y_t )^2]

All networks are trained via BPTT.

3. Training Procedures and Exploration Policies

3.1 DRRN for Natural Language

The training uses standard Q-learning with experience replay. At each step:

  1. Encode state and each available action;
  2. Compute QQ-values for all legal (state, action) pairs;
  3. Sample actions via a softmax policy:

π(at=ais)=exp(αQ(s,ai))jexp(αQ(s,aj))\pi(a_t = a^i \mid s) = \frac{\exp(\alpha Q(s,a^i))}{\sum_j \exp(\alpha Q(s,a^j))}

  1. Update via stochastic gradient descent on the squared temporal-difference loss using Q-learning target with delayed target network parameters.

3.2 Recurrent DRRN for POMDPs

Training employs off-policy RL (TD3 backbone), with LSTM-based actor and twin-critic architectures. Key steps include:

  • Sequence sampling of fixed length ll for LSTM input.
  • Optimization of actor and critic losses over minibatches.
  • Replay buffer storing full trajectories, including hidden/cell states for efficient critic updates (notably in the H-TD3 variant).

4. Empirical Performance and Robustness

4.1 Natural Language Domains

On text-based games “Saving John” (deterministic transitions) and “Machine of Death” (stochastic, paraphrased actions), DRRN achieved:

  • Faster convergence and higher final reward than linear or monolithic DQN baselines.
  • “Saving John”: DRRN ≈ 18.7 average cumulative reward, best baseline ≈ 9.0.
  • “Machine of Death”: DRRN ≈ 11.2, best baseline ≈ 5.2.
  • For paraphrased actions, reward decremented only slightly (11.2 to 10.5), and R20.95R^2\approx0.95 on Q-values for original vs. paraphrase strings.
  • DRRN performance is on par with strong human players in terms of reward, exceeding novice-level human baselines (He et al., 2015).

4.2 POMDPs and Dynamic Disturbance Robustness

In Pendulum-v0 with structured and unstructured disturbance, performance normalized returns were:

  • Vanilla TD3: ≈ 0 under disturbance.
  • LSTM-TD3 (no past action): ≈ 0.45 under random sinusoidal disturbance.
  • LSTM-TD3 (+actions): ≈ 0.68.
  • Single-head LSTM-TD3 (1ha1hc): ≈ 0.82, best among variants.
  • H-TD3: 0.70, with reduced computational cost (30–50% less wall-clock per iteration) (Omi et al., 2023).

Key findings include the criticality of including past action sequences for dynamic disturbance environments, and benefits in both structured and unstructured POMDP regimes of longer LSTM input sequences.

5. Generalization, Ablations, and Architectural Analysis

DRRN’s separation of state and action embeddings enables:

  • Robust semantic generalization; architectures that learn a joint embedding or rely on text token enumeration overfit the specific wording and perform poorly on paraphrased inputs (He et al., 2015).
  • Empirical ablations in recurrent DRRN variants show the necessity of including both past observations and actions for robust performance in environments with temporal or dynamic disturbances (Omi et al., 2023).
  • Sequence length ablation indicates longer LSTM windows enhance identification of environment periodicity and noise suppression.

The structure of single-headed LSTM networks outperformed multi-head variants, supporting the design of compact, integrated sequential modules for both policy and value networks.

6. Practical Implications and Scalability

The design of DRRN frameworks makes them naturally extensible to:

  • Large or unbounded, even open-vocabulary, action spaces (e.g., natural language command spaces), without requiring architecture reconfiguration (He et al., 2015).
  • Complex, memory-dependent control in physical or simulated systems with partial observability, by leveraging recurrent belief-state estimation and action-augmented sequence input (Omi et al., 2023).

Efficient variants, such as H-TD3, reduce inference and training overhead without major performance degradation, supporting real-time deployment in operational POMDP control.

7. Comparative Analysis and Key Insights

  • The explicit decoupling of state and action encoders permits each network to specialize to its respective domain (long-form narrative for state, short imperative for action), a property not shared by single-encoder DQN or linear concatenative models.
  • Embedding both state and action in continuous vector spaces is essential for semantic generalization and robust value estimation in combinatorially large domains.
  • In LSTM-based DRRN, explicit inclusion of the action sequence is causally and empirically critical to distinguish between latent environmental dynamics and spurious disturbance inputs; omitting this information results in significant degradation of control performance in disturbed environments.
  • Recurrent DRRN variants provide a framework for practically robust POMDP control, with quantifiable advantages over non-recurrent or shallow RL approaches in both performance and training efficiency (Omi et al., 2023).

Further advances may involve extending DRRN-style models to hierarchical RL and multi-agent POMDPs, as well as integrating transformers and attention mechanisms for richer sequential processing within the recurrent DRRN paradigm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Reinforcement Recurrent Network (DRRN).