Shared Recurrent Memory Transformer (SRMT)

Updated 1 December 2025

SRMT is an architectural framework that integrates a shared recurrent memory to enable implicit coordination in multi-agent reinforcement learning and document-level translation.
It employs attention-based self and cross-memory updates, allowing agents and processing steps to flexibly share context while maintaining computational efficiency.
Empirical results reveal that SRMT enhances performance in complex pathfinding and translation tasks by robustly handling sparse rewards and long-range dependencies.

The Shared Recurrent Memory Transformer (SRMT) is an architectural extension of memory-augmented transformers designed to enable implicit information exchange and coordination in decentralized, partially observable, multi-agent sequential decision processes, as well as document-level sequence modeling. Two main research contributions define SRMT: its application to multi-agent lifelong pathfinding (Sagirova et al., 22 Jan 2025) and its usage for capturing long-range dependencies in document-level machine translation (Feng et al., 2022). Both approaches share the centralized concept of a recurrent, shared memory that is updated and broadcast among agents or processing steps through attention mechanisms. This memory allows for scalable, flexible coordination and context accumulation while maintaining computational tractability.

1. Problem Formalization and Motivation

In decentralized, partially observable multi-agent Markov decision processes (Dec-POMDPs), each agent $u \in \{1, \dots, n\}$ perceives only a local observation $o_t^{(u)} = \mathcal{O}(s_t)$ and must determine its actions $a_t^{(u)}$ solely from its action-observation history $h_t^{(u)} = (o_1^{(u)}, a_1^{(u)}, \dots, o_t^{(u)})$ . The global system is described as

$M = \langle S, U, A, P, R, O, \mathcal{O}, \gamma \rangle$

where $S$ denotes the environment state, $A$ the per-agent actions, $P$ the transition probability, $R$ the reward, $O$ the observation space, and $\gamma$ the discount factor (Sagirova et al., 22 Jan 2025). The objective is to maximize the expected return

$J(\theta) = \mathbb{E} \left[ \sum_{t=0}^T \gamma^t R(s_t, u, a_t^{(1..n)}) \right]$

with decentralized stochastic policies $\pi_\theta^{(u)}(a_t^{(u)} | h_t^{(u)})$ .

Similarly, document-level machine translation models must capture dependencies across distant sentences, which standard (sentence-wise) transformers cannot effectively represent. A major limitation of vanilla transformers is their lack of recurrence and bounded context window, resulting in sub-optimal exploitation of contextual coherence over long spans (Feng et al., 2022).

SRMT addresses these challenges by introducing a shared, recurrent memory for implicit context transmission—whether across agents in multi-agent reinforcement learning (MARL) or sentences in document-level natural language processing.

2. SRMT Core Architecture

The core SRMT mechanism centers on a memory-augmented transformer cell, denoted "RMTCell," extended to multi-agent or sequential settings. Each agent $i$ maintains a working memory vector $mem_{i, t} \in \mathbb{R}^{d_m}$ at time $t$ (or, analogously, per-processing step in sequential applications). The architecture comprises:

Spatial encoder: ResNet followed by an MLP that encodes local observations to an embedding space.
Recurrent transformer cell (SRMTCell): Performs multi-head self-attention over the agent’s or sentence’s recent internal states and cross-attention with the shared memory pool.
Memory head: Projects the attentional output to an updated memory slot for the next step.

The shared memory pool at each time $t$ is $M_t = [mem_{1, t}, \dots, mem_{n, t}] \in \mathbb{R}^{n \times d_m}$ . At each update, an agent (or step) forms a sequence $X_{i, t} = [mem_{i, t-1}; h_{i, t-\hat h}, ..., h_{i, t}]$ and performs:

Multi-head self-attention on $X_{i, t}$ :

$\text{SelfAttn}(X) = \text{softmax} \left( \frac{Q K^\top}{\sqrt{d_k}} \right) V$

Cross-attention between the current hidden and the shared memory pool:

$\text{CrossAttn}(H, M_t) = \text{softmax} \left( \frac{Q'(K')^\top}{\sqrt{d_k}} \right) V'$

The combined output is mapped by the memory head to produce $mem_{i, t}$ .

The update may be summarized as:

$mem_{i, t} = \text{RMTCell}(mem_{i, t-1}, H_{i, t}, M_{t-1})$

After all updates, $M_t$ is formed and broadcast for the subsequent step. This implicit, differentiable memory-pooling approach induces peer coordination without bespoke message-passing protocols (Sagirova et al., 22 Jan 2025, Feng et al., 2022).

3. Implicit Coordination via Shared Memory

SRMT implements a global workspace via the pooled memory vectors of all agents, which serve as the sole communication substrate. Agents read from and write to the pool using cross-attention layers, implicitly conveying intentions, bottleneck conflicts, and planned paths. Unlike explicit communication protocols that require agents to transmit discrete messages, SRMT’s attention-based interaction allows each agent to selectively focus on relevant peers and adaptively weigh input. This design has several salient implications:

Robust emergent coordination, observed as agents yielding in bottleneck scenarios even under highly sparse rewards.
Scalability to larger agent teams by maintaining a compact shared memory tensor.
Improved generalization capabilities, as coordination patterns are learned without overfitting to explicit signaling routines.

In document-level sequence modeling, this paradigm enables each sentence or processing step to incorporate and propagate long-range contextual knowledge via the shared memory, resulting in improved coherence and capturing global document structure (Feng et al., 2022).

4. Optimization and Training Methodologies

SRMT-based multi-agent policies are trained with Proximal Policy Optimization (PPO), utilizing the clipped surrogate loss:

$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) \hat{A}_t, \mathrm{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right]$

where $r_t(\theta)$ is the importance ratio and $\hat{A}_t$ the generalized advantage estimator (Sagirova et al., 22 Jan 2025). The total loss function includes a value prediction and entropy regularization term. Typical hyperparameter regime: Adam optimizer, learning rate $\sim 10^{-4}$ , entropy bonus $c_{\text{ent}} \sim 10^{-2}$ , value loss coefficient $c_v = 0.5$ , recurrence window $\hat h = 8$ .

In document-level machine translation, SRMT employs a two-stage training strategy: sentence-level pretraining using standard cross-entropy, followed by document-level fine-tuning where weights are initialized from pretraining and document memory units are recurrently updated (Feng et al., 2022). Gradients are passed only through the current and previous step for computational efficiency; memory updates are not backpropagated across full documents.

5. Empirical Findings and Benchmark Performance

Multi-Agent Pathfinding

In the Bottleneck navigation task, SRMT demonstrates:

Consistent outperformance over deep RL baselines (MAMBA, QPLEX, ATM, RATE, RRNN) under Directional, Moving-Negative, and notably Sparse reward regimes.
Robustness: Cooperative Success Rate (CSR) $\approx 1$ for corridors up to length $\approx 400$ (Sparse reward) and SoC minimization, significantly surpassing ablations without memory sharing.
Generalization: Successfully extrapolates to longer corridors (length up to 1000) and larger unseen maps.

On the POGEMA lifelong MAPF benchmark, SRMT:

Achieves strong throughput (goals per agent-step) on Mazes, Random, Puzzle, and MovingAI map types, outperforming MAMBA/QPLEX and matching or exceeding planning-based (Follower, MATS-LP, RHCR) algorithms except in the Warehouse setting.
Generalizes across agent counts (64 and 128) when trained in mixed-team configurations.
SRMT augmented with Follower-path planning surpasses all baselines including centralized planners on Warehouse maps (Sagirova et al., 22 Jan 2025).

Document-Level Translation

For document-level machine translation, SRMT attains:

Average improvement of $+0.91$ s-BLEU over standard sentence-level baselines across TED, News, and Europarl datasets.
New state-of-the-art on TED Talks ( $+0.50$ s-BLEU vs. prior best) and News commentary ( $+1.49$ d-BLEU), highlighting its capacity for long-range context modeling.
Negligible additional computational overhead per sentence due to fixed-size shared memory, avoiding the quadratic scaling of naive sequence concatenation (Feng et al., 2022).

The table below summarizes the evaluation metrics for different MARL baselines and SRMT in the Bottleneck and POGEMA tasks, as reported in (Sagirova et al., 22 Jan 2025):

Metric	Bottleneck (SRMT)	Bottleneck (MAMBA)	POGEMA Maze (SRMT)	POGEMA Maze (MAMBA)
Cooperative Success Rate	$\approx 1$	< 1 (varies)	High	Lower
Generalization Length	Up to 1000	Fails >30–50	Robust	Lower
Throughput	N/A	N/A	High	Lower

6. Implementation and Practical Considerations

SRMT’s implementation parameters for multi-agent RL tasks involve a lightweight encoder (ResNet with a single block, 8 filters) and transformer core ( $d_h=16$ , 4 heads, $d_m=16$ ) for the Bottleneck benchmark, and a larger configuration ( $d_h=512$ , 8 heads) for POGEMA/Lifelong MAPF. PPO is run with learning rates $1.3$– $2.2 \times 10^{-4}$ , episode length $512$ for the Bottleneck, up to 600M steps for large-scale experiments. Codebase and pretrained models are publicly available (Sagirova et al., 22 Jan 2025).

For document-level MT, the memory unit consists of $d_M=16$ slots per layer and is injected exclusively into the top encoder and decoder layers, minimizing per-sentence computational overhead (Feng et al., 2022).

Noteworthy limitations include reliance on perfect localization, synchronized agent actuations, and static obstacle assumptions in MARL scenarios. There are no theoretical completeness guarantees for the pathfinding solution. Future research directions include memory slot scheduling, adaptive pooling for scalability, and deeper integration with differentiable planners (Sagirova et al., 22 Jan 2025).

SRMT descends from the memory transformer and RMT (recurrent memory transformer) lineages. Distinctively, it generalizes the notion of recurrence and shared context to a global, jointly-addressable memory pool, as opposed to local or sequential recurrence found in standard RNNs or per-sequence memory approaches.

In comparison with explicit multi-agent communication protocols, SRMT’s implicit coordination mechanism via shared attention offers greater flexibility and robustness, particularly under sparse or delayed reward feedback. The approach contrasts with “concatenate-all” document modeling in NLP, which suffers quadratic complexity growth with input length.

Previous work demonstrated the efficacy of recurrent memory augmentations in document-level translation (Feng et al., 2022); SRMT formalizes and extends these techniques to multi-agent pathfinding, yielding improvements in coordination, scalability, and generalization to novel environments (Sagirova et al., 22 Jan 2025).

The Shared Recurrent Memory Transformer thus provides a scalable, recurrent, and attention-centric framework for both multi-agent coordination under partial observability and long-range sequential modeling in NLP, substantiated by empirical advances in pathfinding and document-level translation benchmarks.