LMRL-Gym: Multilingual RL Benchmark

Updated 16 June 2026

LMRL-Gym is a framework for multi-turn reinforcement learning with language models that integrates diverse tasks and multilingual support.
It features procedurally generated environments including maze navigation, text-based puzzles, and dialogue tasks to assess strategic planning and delayed rewards.
The suite supports both offline and online RL methods via standardized APIs, promoting reproducibility, scalability, and extensibility in research.

LMRL-Gym refers to a set of frameworks and benchmark environments for multi-turn reinforcement learning (RL) with LLMs. The term encompasses recent efforts to develop reproducible, extensible, and linguistically diverse platforms where RL-trained language agents can be evaluated on nontrivial, temporally extended reasoning and interaction tasks. LMRL-Gym environments are designed to probe core RL properties—such as partial observability, delayed credit assignment, and strategic language generation—under both monolingual and multilingual regimes. The most prominent instances are outlined in (Abdulhai et al., 2023, Dobler et al., 11 Mar 2026), and are informed by advances in agentic memory testbeds (Xu et al., 20 May 2026).

1. Task Suite Design and Environment Structure

LMRL-Gym environments consist of procedurally generated language-based tasks that require multi-turn decision making and goal-directed interaction. The base structure, as implemented in (Abdulhai et al., 2023), includes eight tasks divided into two categories:

RL-Capability Tests: Maze (FO/PO), Text Navigation (FO/PO), Wordle, Chess, Endgames. These target core RL challenges such as trajectory stitching, strategic planning, and handling partial observability.
Interactive Dialogue Tasks: Twenty Questions, Guess My City, Car Dealer. These tasks focus on goal-driven conversational interaction, negotiation, and strategic information gathering.

Each environment is formally modeled as a (partially observable) Markov Decision Process (POMDP), where:

State $s_t \in \mathcal{S}$ is typically the entire token history or a symbolic representation (e.g., grid coordinates, FEN string).
Action $a_t \in \mathcal{A}$ is a token sequence (command or utterance) or structured move.
Reward $r(s_t, a_t)$ is task-specific (e.g., sparse on goal achievement or dense for progress).
Discount factor $\gamma \in (0,1]$ and horizon $T$ define the RL optimization objective.

The environments support both direct RL-from-environment (online) and offline RL training using logged trajectories (Abdulhai et al., 2023).

2. Procedural Multilingual Reasoning (MLRL-Gym Extension)

The "Multilingual Reasoning Gym" (Dobler et al., 11 Mar 2026) extends the LMRL-Gym paradigm by providing a procedural environment that generates verifiable reasoning problems in 14 languages: English, Chinese, German, Spanish, French, Italian, Brazilian Portuguese, Russian, Japanese, Korean, Thai, Bengali, Telugu, and Swahili.

Key properties of the multilingual extension include:

Uniform procedural core: Each task is defined by a generator $G_i(\cdot)$ , verifier $V_i(\cdot)$ , and string template $T_i(\theta; \ell)$ parameterized by difficulty $d \in [0,1]$ . The same core code and parametric distributions are used across all languages, with only the rendering layer redirected to language-specific templates.
Parallel corpora generation: Given a random seed and difficulty setting, the environment produces perfectly parallel problem instances across all supported languages—i.e., the same underlying example rendered in multiple natural languages.
Translation pipeline: Non-English templates are obtained by LLM-based translation (Claude Sonnet-4) with native-speaker validation and systematic template adaptation. Template fixes address issues such as English-specific morphology and formatting, ensuring linguistic naturalness and problem equivalence.
Empirical throughput: Parallel instance generation is implemented at ~10,000 samples/sec on commodity hardware, enabling scalable evaluation and RL data generation (Dobler et al., 11 Mar 2026).

3. Reinforcement Learning Algorithms and API Integration

LMRL-Gym provides a standardized, Gym-style API to facilitate RL algorithm development and benchmarking. The interface resembles:

$a_t \in \mathcal{A}$ 7

RL algorithms integrated and benchmarked on LMRL-Gym include:

Behavioral Cloning (BC) and variants: supervised fine-tuning on demonstration data.
Monte-Carlo Returns (MC): Q-head regression to return-to-go, policy as logit-perturbed BC.
Implicit Q-Learning (ILQL): offline value-based RL using expectile regression and conservative Q-learning (CQL) regularizer. Policy extraction is via logit perturbation.
Proximal Policy Optimization (PPO): online RL with GAE advantage estimation, KL and behavior-cloning regularization.

Objective functions, as precisely specified in (Abdulhai et al., 2023), include Bellman error minimization, value expectile regression, and CQL for ILQL:

$\mathcal{L}_Q = \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left(Q_{\theta}(s,a) - (r + \gamma V_{\theta}(s'))\right)^2 \right]$

$a_t \in \mathcal{A}$ 0

$a_t \in \mathcal{A}$ 1

$a_t \in \mathcal{A}$ 2

4. Memory and Long-Horizon Extensions

Recent developments (Xu et al., 20 May 2026) introduce memory-centric RL environments extrapolating from the LMRL-Gym architecture. Notable design components applicable to an LMRL-Gym include:

Memory boundary abstraction: Environment separates memory management as a wrapper module (BaseMemoryEnvironment + BaseMemoryManager), enabling explicit evaluation and optimization of agentic memory components.
Memory-isolated scoring: Evaluations compare task performance under fixed reasoning policy $a_t \in \mathcal{A}$ 3 with and without a memory module $a_t \in \mathcal{A}$ 4, yielding the memory-isolated gain $a_t \in \mathcal{A}$ 5.
Synthetic memory-grounded pipelines: Environments generate tasks where progress explicitly requires long-term retention, with ablation and difficulty-dial verified test cases.
Dense learned critics: Introduction of reward models (e.g., MemRM), trained to provide step-level reward signals for memory updates, enabling actor-critic or policy-gradient optimization on memory actions.

These features suggest that the LMRL-Gym paradigm is extensible to rich, long-horizon, and memory-dependent agentic reasoning scenarios (Xu et al., 20 May 2026).

5. Evaluation Metrics and Experimental Protocols

LMRL-Gym supports rigorous and comparable evaluation protocols:

Task success metrics: Average return, final success rate, sample efficiency, and normalized score across tasks. Normalization aligns raw rewards to a [0, 100] range anchored at the offline dataset mean and optimal/lowest returns.
RLVR metrics (Multilingual Reasoning Gym): average@8 (mean accuracy over 8 attempts), pass@8 ( $a_t \in \mathcal{A}$ 6), and language consistency (fraction of rollouts where output remains in the expected query language) (Dobler et al., 11 Mar 2026).
Parallel evaluations: Multilingual and cross-seed evaluations allow systematic comparison of reasoning abilities under controlled variation.
Model performance trends: Larger models (e.g., Qwen3-14B) achieve higher reasoning accuracy, especially in high-resource languages and at lower difficulty, while performance systematically degrades under increased difficulty or in low-resource linguistic settings.

These protocols expose strategic and cross-lingual generalization weaknesses in current LLMs and track progress of RL-driven improvements (Abdulhai et al., 2023, Dobler et al., 11 Mar 2026).

6. Codebase, Extensibility, and Practical Usage

A modular, extensible codebase is provided (Abdulhai et al., 2023, Dobler et al., 11 Mar 2026), with environment APIs, offline datasets, RL algorithm implementations, and configuration templates. Key design features include:

Environment wrappers for each task, adhering to the Gym interface.
Config-driven experimentation: Experiments are run via simple YAML files specifying environment, algorithm, and hyperparameters.
Easy task extension: Custom tasks are added by subclassing BaseTaskEnv, providing reset/step/render methods, offline data, and configuration files.

The modular design facilitates rapid prototyping and benchmarking of new RL algorithms, memory systems, and task variations, promoting reproducibility and extensibility in LLM-based RL research.

7. Significance and Research Context

LMRL-Gym benchmarks constitute foundational infrastructure for research on RL with LLMs. They address the need for standardized, multi-turn, and linguistically diverse evaluation environments that provide rigorous assessment of goal-directed agentic behavior, strategic language generation, and memory retention capabilities. The procedural, parallel, and extensible design of these environments has enabled large-scale, controlled studies on scaling trends, cross-lingual transfer, and the effect of RL algorithms on emergent reasoning abilities. They also highlight the persistent challenges in aligning RL-trained LLMs to perform reliably in settings requiring delayed credit assignment and long-horizon memory.

Comprehensive documentation, code, and instructions are available at the official repositories: https://github.com/abdulhaim/LMRL-Gym (Abdulhai et al., 2023) and https://github.com/apple/ml-multilingual-reasoning-gym (Dobler et al., 11 Mar 2026).