General Reinforcement Learning Algorithm

Updated 23 January 2026

General RL is a paradigm that designs agents to operate in arbitrary, non-Markovian, and partially observable environments by optimizing the expected discounted reward.
Methodologies like MERL use candidate elimination and optimistic planning to achieve PAC guarantees and control sample complexity across diverse models.
Recent advances incorporate meta-learning approaches, such as LPG, to automate RL update rules and enhance performance in benchmarks like AlphaZero and Atari.

A general reinforcement learning (RL) algorithm is one designed to succeed across broad classes of environment models with minimal reliance on assumptions such as Markovian, stationary, or fully observable properties. The objective is to develop agents that can efficiently approach optimal behavior regardless of specific structural properties of the environment, including non-Markov, history-dependent, and partially observable scenarios. This concept motivates both theory—quantifying learnability and sample complexity of arbitrary environment families—and practice, where the goal is to automate or meta-learn RL update rules effective across diverse tasks and context distributions.

1. Fundamental Problem Formulation

General RL posits that the true environment belongs to an arbitrary (finite or infinite) class $\mathcal{M} = \{\nu_1,\ldots,\nu_N\}$ , with interaction defined through episodic or non-episodic protocols. At each time-step, the agent selects an action $a_t\in\mathcal{A}$ , observes $o_t$ and receives reward $r_t\in[0,1]$ , accumulating a history $h_t=(a_1,o_1,r_1,\ldots,a_{t-1},o_{t-1},r_{t-1})$ . The agent’s objective is to maximize the expected discounted sum of rewards under the unknown environment $\mu\in\mathcal{M}$ , formalized as

$V^\pi_\nu(h) = \mathbb{E}^\pi_\nu\Big[\;\sum_{k=0}^\infty \gamma^k r_{t+k}\;\Big|\;h_t=h\Big],$

where $\gamma\in(0,1)$ is a discount factor, and $\pi$ is a (potentially history-dependent) policy. The general RL setting admits arbitrary stochastic, non-Markovian, and non-ergodic dynamics so long as $\mu\in\mathcal{M}$ (Lattimore et al., 2013, Leike, 2016).

2. Minimax and PAC Sample Complexity in Arbitrary Model Classes

The sample complexity question in general RL concerns the number of rounds on which a policy fails to be near-optimal (i.e., $V^*_\mu(h_t) - V^\pi_\mu(h_t) > \varepsilon$ ). For a finite model class $|\mathcal{M}|=N$ , the core result is the existence of the MERL (Maximum Exploration Reinforcement Learning) algorithm, which is $(\varepsilon, \delta)$ -PAC: with probability $1-\delta$ , all but

$O\left(\frac{N}{\varepsilon^2 (1-\gamma)^3} \ln^2\frac{N}{\delta \varepsilon (1-\gamma)}\right)$

steps have value loss at most $\varepsilon$ (Lattimore et al., 2013). MERL maintains a candidate environment set $M_t\subseteq\mathcal{M}$ , repeatedly solves an optimistic planning problem over $M_t$ , and eliminates models via statistically decisive deviations in observed returns. The minimax lower bound, based on bandit-like hard instances, matches this scaling up to constants.

Critically, the result applies completely independently of Markov or stationarity assumptions—merely that $\mathcal{M}$ is known and finite. For infinite $\mathcal{M}$ , compactness with respect to the pseudo-metric $d(\nu,\nu') = \sup_{h,\pi}|V^\pi_\nu(h) - V^\pi_{\nu'}(h)|$ is necessary and sufficient to move from vacuous to nontrivial sample complexity: the bound becomes proportional to the covering number $N_\varepsilon$ with $\varepsilon$ -balls under $d$ . In the absence of such compactness, no uniform guarantee is possible; adversarially constructed “needle-in-haystack” environments can subvert any learning algorithm (Lattimore et al., 2013).

3. Algorithmic Frameworks for General RL

3.1. MERL and Nonparametric Bayesian Reinforcement Learning

MERL alternates between exploration and exploitation based on the “value gap” between optimistic and pessimistic environments in $M_t$ . When the gap exceeds a threshold, it commits to an exploration phase (fixed length $d=\lceil \log(\epsilon/8)/\log\gamma\rceil$ ), gathers returns under a computed policy, and uses martingale concentration to statistically rule out inconsistent candidate models. The process recurses until $M_t$ contracts to a singleton or all value-gaps are small; in that event, the agent exploits the optimistic policy.

For nonparametric and countably infinite environment classes (e.g., all computable semimeasures), Bayesian agents based on Thompson sampling achieve the property of asymptotic optimality in mean—the expected value of the policy approaches the value of the optimal policy in the true environment (Leike, 2016). Under mild recoverability conditions on the environment, Thompson sampling further offers provably sublinear regret.

Crucially, these algorithms operate with no reference to structure—Markov, finite-memory, or stationarity—beyond membership in $\mathcal{M}$ . This establishes a universal learning framework in the broadest possible sense.

3.2. Meta-Learning and Automated Algorithm Discovery

Recent work has shifted toward automated discovery of RL update rules that generalize across entire environment families. The Learned Policy Gradient (LPG) approach parameterizes the agent update (including “what to predict” and “how to bootstrap”) as a meta-learned function, then optimizes expected returns over a diverse distribution of environments (Oh et al., 2020). Outer meta-optimization tunes these meta-parameters via gradient ascent, typically using batched environments, surrogate losses, and regularization (entropy, L2).

Subsequent frameworks, such as GROOVE, adversarially design the meta-training set by maximizing algorithmic regret—levels or environments where the discovered algorithm notably underperforms baseline RL optimizers are preferentially selected for future meta-updates (Jackson et al., 2023). This mechanism produces meta-optimizers with demonstrably improved out-of-distribution generalization, e.g., transfer from gridworlds to Atari benchmarks.

4. Generalization and Limits in Function Approximation

When value functions are approximated by a general function class $\mathcal{F}$ (possibly nonlinear), general RL decomposes into the complexity of $\mathcal{F}$ . Under the Bellman-closure property (every Bellman backup lies in $\mathcal{F}$ ), algorithms combining least-squares fitting to one-step backups with bonus terms derived from the eluder dimension of $\mathcal{F}$ achieve regret

$\widetilde{O}\left(\mathrm{poly}(d,H)\sqrt{T}\right)$

with $d$ the eluder dimension and $H$ the horizon (Wang et al., 2020, Zhao et al., 2023). These results generalize tabular and linear RL sample-complexity to broad classes including small neural nets and generalized-linear models, sidestepping parametric assumptions.

The optimal trade-off between exploration (robust upper-confidence bounds), rare policy switching (to reduce deployment cost), and sample reuse (to improve efficiency) is achieved via monotonic value-iteration structures, variance-weighted regression, and the theoretical machinery of confidence-sets controlled by function-class covering numbers (Zhao et al., 2023, Queeney et al., 2022).

5. Broader Generalizations: Occupancy, Utility, and Task-Conditioning

General RL methodologies are being further generalized along multiple axes:

General-utility RL: Agents optimize arbitrary smooth functionals of the normalized occupancy measure $\lambda^\pi(s,a)$ , not just expected return. Critic-actor schemes approximate occupancy measures and allow for global optimality guarantees when $U$ is concave, with sample complexity scaling only with the parameterization dimension (Barakat et al., 2024).
Distribution-conditioned and context-general policies: Algorithms such as DisCo RL parameterize the policy by a goal-distribution or context vector $\omega$ , constructing reward functions as log densities under $p_g(s;\omega)$ . This enables the learning of universal policies capable of adapting to entire families of reward and constraint objectives via off-policy actor-critic schemes (Nasiriany et al., 2021).
Meta-Parameter Discovery and Automated Curriculum: Systematic automatic design of training curricula and update rules through quantifying task informativeness (via algorithmic regret, diversity, or hand-crafted challenge metrics) closes the learning-theoretic loop and realizes practical generalization on real-world and synthetic benchmarks (Jackson et al., 2023, Oh et al., 2020).

6. Applications and Empirical Successes

The most prominent practical instance is AlphaZero (Silver et al., 2017), a domain-agnostic deep RL algorithm combining tabula-rasa self-play, deep policy-value networks, and guided Monte Carlo tree search. It succeeded in attaining superhuman performance across Go, Chess, and Shogi purely from the game rules, with identical learning and search recipes, demonstrating the possibility and principles of a domain-transferable general RL algorithm.

Meta-learned algorithms such as LPG have been shown to generalize from toy environments to robust performance on the Atari-57 suite, and approaches using adversarial environment design (GROOVE) report superior generalization across unseen Min-Atar and Atari games (Oh et al., 2020, Jackson et al., 2023).

Model-free deep RL architectures such as MR.Q demonstrate that a single hyperparameter configuration and model architecture can achieve competitive learning curves and asymptotic performance across diverse continuous and discrete control benchmarks (MuJoCo, DMC, Atari), with no domain-specific adaptation or planning (Fujimoto et al., 27 Jan 2025).

7. Open Problems and Research Frontiers

Fundamental limitations remain regarding computational tractability (optimistic planning in MERL is generally intractable), unavoidable scaling with model or function class complexity, and the structure-slip problem—universal algorithms cannot exploit additional, unknown structure (e.g., “Markovianity”) unless explicitly encoded or discovered. Open questions include the tightness of sample-complexity exponents in the discount factor $(1-\gamma)$ , automatic exploitation of shared substructure to reduce dependence on $N$ or $d$ , and the design of practical heuristics that approximate general RL planning steps in realistic settings (Lattimore et al., 2013).

Emerging directions include hybrid generalization schemes that unify dynamic task-conditioning, occupancy-based utilities, and meta-learned update rules; leveraging richer environment generators for meta-training; and universal policy-search methods scaling to high-dimensional or multi-agent settings (Barakat et al., 2024, Jackson et al., 2023, Leike, 2016). The ultimate goal remains a unified theory and algorithmic toolkit that enables single-shot, data-efficient, and robust RL in arbitrary, possibly hostile or unstructured, interactive environments.