Papers
Topics
Authors
Recent
Search
2000 character limit reached

LEO: Learning Everything All at Once

Updated 4 July 2026
  • LEO is a framework unifying incremental skill learning with parallel all-goals updates, merging diverse methods into one recurrent system.
  • The method uses a vectorized Bellman update and replay-based consolidation to efficiently update over an entire discrete goal set, achieving >250× speedup over naive approaches.
  • Dual implementations in LEO leverage teacher–student paradigms and consolidated retraining to enhance cross-task performance while managing computational trade-offs.

Searching arXiv for the cited papers and closely related context. arXiv query: (Schmidhuber, 2018) OR "One Big Net For Everything" Learning Everything all at Once (LEO) denotes a family of ideas about extracting maximal cross-task or cross-goal information into a single learner, but the term is used in two distinct senses in the literature. In Schmidhuber’s "One Big Net For Everything," the relevant idea is an increasingly general problem solver, called ONE, that acquires skills incrementally and periodically consolidates them into a single recurrent neural system through replay, imitation, prediction, and compression (Schmidhuber, 2018). In the later goal-conditioned reinforcement learning usage, LEO is a specific architectural reformulation in which a network outputs predictions for the entire discrete goal set in one forward pass, enabling efficient parallel all-goals updates rather than naive relabelling over goals (Matthews et al., 22 May 2026). The shared theme is not literal simultaneous end-to-end optimization over every task from the outset, but the use of one representational substrate to absorb broad competence.

1. Terminological scope and conceptual framing

The phrase “Learning Everything all at Once” is most explicitly formalized in the 2026 paper "Goal-Conditioned Agents that Learn Everything All at Once," where it names a method for making all-goals learning practical in goal-conditioned reinforcement learning (Matthews et al., 22 May 2026). There, “everything” means predictions for every goal in a finite goal set are produced simultaneously from a state input, allowing one transition to update all goal heads in parallel.

The same phrase can also be used, as the supplied interpretation states, to read Schmidhuber’s 2018 proposal "One Big Net For Everything" through a broader systems lens (Schmidhuber, 2018). In that setting, the “all at once” effect is realized through periodic global consolidation into one big net rather than through a single monolithic joint training run. The proposal is therefore best understood as continual learning, multitask consolidation, transfer learning through subroutine reuse, and PowerPlay-style self-improving problem solving combined inside one recurrent general-purpose computer.

This dual usage matters because it prevents a common misconception. LEO does not necessarily mean that all tasks are jointly optimized from scratch in one static objective. In ONE, tasks arrive incrementally and are folded back into a single model through offline replay-based retraining. In the 2026 goal-conditioned RL formulation, by contrast, the “all at once” aspect is architectural and vectorized: the network predicts values or actions for all goals at once.

2. ONE as a recurrent consolidation architecture

ONE is described as a single recurrent neural network, or more generally any differentiable general-purpose computer, trained via black-box optimization, reinforcement learning, artificial evolution, supervised learning, and unsupervised learning (Schmidhuber, 2018). The central objective is an increasingly general problem solver that can continually acquire additional abilities without losing previous ones, while remaining a single neural system.

At each time step tt, ONE may receive normal sensory input in(t)Rmin(t) \in \mathbb{R}^m, reward input r(t)Rnr(t) \in \mathbb{R}^n, and optionally a dedicated goal vector goal(t)Rpgoal(t) \in \mathbb{R}^p. The full sensory stream is defined as

sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},

the concatenation of in(t)in(t), goal(t)goal(t), and r(t)r(t). Rewards are vector-valued, with

R(t)=i=1nri(t),CR(t)=τ=1tR(τ).R(t)= \sum_{i=1}^{n}r_i(t), \qquad CR(t)= \sum_{\tau=1}^{t}R(\tau).

The network emits an action vector

out(t)Ro,out(t) \in \mathbb{R}^o,

and may additionally emit prediction outputs in(t)Rmin(t) \in \mathbb{R}^m0, cumulative reward predictions in(t)Rmin(t) \in \mathbb{R}^m1, and optional learned representations in(t)Rmin(t) \in \mathbb{R}^m2.

A distinctive mechanism is the use of goal-defining input patterns. If tasks are not transmitted through ordinary inputs, a unique task-specific goal vector

in(t)Rmin(t) \in \mathbb{R}^m3

is selected for task description in(t)Rmin(t) \in \mathbb{R}^m4, and during the corresponding trial,

in(t)Rmin(t) \in \mathbb{R}^m5

This makes task-conditioned behavior a property of one shared parameterization rather than a collection of separate controllers.

The architecture is explicitly multi-head and recurrent. It combines acting, prediction, cumulative reward estimation, and optional representation learning in one body. This suggests that LEO, in the ONE sense, is primarily a representational unification program: acting, predicting, compressing experience, estimating rewards, and representing task goals are all meant to be absorbed into a single recurrent substrate.

3. Incremental skill acquisition and replay-based consolidation

The training pipeline in ONE has two phases: skill acquisition and dream/consolidation (Schmidhuber, 2018). The starting point is an existing ONE that already solves several tasks and contains predictive knowledge from prior data. Algorithm 1 creates two copies: ONE1, intended to benefit from prior knowledge, and ONE0, a copy of the original or pretraining state that acts as a from-scratch baseline or safety belt.

A black-box optimizer is then applied to ONE0 and ONE1 for up to in(t)Rmin(t) \in \mathbb{R}^m6 seconds. The paper allows neuroevolution, hierarchical neuroevolution, hierarchical policy gradient methods, and asymptotically optimal algorithmic transfer learning. The objective for control is to maximize cumulative reward at the end of trial,

in(t)Rmin(t) \in \mathbb{R}^m7

During this phase, a copy may freeze old weights and add a few new units or connections, retrain all weights, or exploit inherited subroutines. It may also forget prior skills, and the framework explicitly permits this.

The key LEO-like step is offline replay-based retraining for in(t)Rmin(t) \in \mathbb{R}^m8 seconds. Relevant traces from old tasks still worth memorizing are replayed; successful traces from the newly trained copy are replayed; and all traces, including failures, are replayed for prediction and compression learning. The paper defines

in(t)Rmin(t) \in \mathbb{R}^m9

as the concatenation of r(t)Rnr(t) \in \mathbb{R}^n0, r(t)Rnr(t) \in \mathbb{R}^n1, r(t)Rnr(t) \in \mathbb{R}^n2, and optionally r(t)Rnr(t) \in \mathbb{R}^n3, r(t)Rnr(t) \in \mathbb{R}^n4, and a trial trace as

r(t)Rnr(t) \in \mathbb{R}^n5

Retraining then uses standard gradient methods to imitate action outputs r(t)Rnr(t) \in \mathbb{R}^n6 on relevant successful traces, predict future sensory and reward inputs via

r(t)Rnr(t) \in \mathbb{R}^n7

optionally learn codes r(t)Rnr(t) \in \mathbb{R}^n8, and optionally simplify the network via regularization and pruning. Successful relevant traces are used as action targets for behavior retention or new-skill incorporation. Unsuccessful and superseded traces are not used as action targets, but all traces, including failed ones, are used for prediction and world knowledge.

This division between behavior replay and predictive replay is central. Failed behavior should not be preserved as policy, but failed experience still contains information about environment dynamics. The framework therefore treats lifelong experience as both a source of behavioral targets and a source of predictive compression.

4. LEO in goal-conditioned reinforcement learning

In the 2026 formulation, LEO is defined within a standard goal-conditioned MDP

r(t)Rnr(t) \in \mathbb{R}^n9

with goal-conditioned policy

goal(t)Rpgoal(t) \in \mathbb{R}^p0

trained to maximize

goal(t)Rpgoal(t) \in \mathbb{R}^p1

The commanded goal affects the policy but does not affect environment dynamics, so trajectories gathered for one goal can be reused off-policy for other goals (Matthews et al., 22 May 2026).

The standard UVFA-style Q-network is written as

goal(t)Rpgoal(t) \in \mathbb{R}^p2

LEO instead defines an all-goals Q-function

goal(t)Rpgoal(t) \in \mathbb{R}^p3

The paper describes this as currying the goal variable from the input to the output. Operationally, this means a single forward pass computes a matrix with one row per goal and one column per action. Acting for goal goal(t)Rpgoal(t) \in \mathbb{R}^p4 then consists of indexing the corresponding row.

The resulting all-goals update rule is the vectorized Bellman loss

goal(t)Rpgoal(t) \in \mathbb{R}^p5

Here, goal(t)Rpgoal(t) \in \mathbb{R}^p6 is the reward vector over all goals upon entering goal(t)Rpgoal(t) \in \mathbb{R}^p7, goal(t)Rpgoal(t) \in \mathbb{R}^p8 is the vector over goals of Q-values for the chosen action, and the max over goal(t)Rpgoal(t) \in \mathbb{R}^p9 is taken per goal. A single transition sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},0 therefore supplies positive signal for goals achieved at sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},1, negative or bootstrapped signal for goals not achieved, and simultaneous updates to all goal-specific heads.

For continuous control, the same currying idea is extended to actor-critic. The all-goals critic loss is

sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},2

and the policy loss is

sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},3

The paper emphasizes, however, that the efficiency gain is strongest on the critic side; actor updates in continuous control still scale with sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},4 because different goal heads imply different actions.

5. Computational properties, empirical behavior, and Dual LEO

The core computational claim of LEO is that naive all-goals relabelling with a UVFA-style architecture is prohibitively expensive because compute scales linearly with the number of goals. In CraftaxGC, where the full goal set has size 512, naive all-goals relabelling enlarges each batch by a factor of 512. LEO avoids this by sharing the expensive state encoder once, producing all goal heads in one forward pass, computing all Bellman errors as one vectorized loss, and updating all heads with one backward pass (Matthews et al., 22 May 2026).

The paper reports a direct speed comparison on CraftaxGC with goal set size 512: LEO learns with respect to the entire goal set with only a 34% slowdown compared to regular single-goal learning, whereas naive all-goals relabelling is sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},5 slower than LEO. The abstract summarizes this as a sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},6 speed-up compared to all-goals relabelling. In CraftaxGC, this computational benefit is coupled to stronger empirical performance: on the full 512-goal benchmark, PQN and PPO perform poorly, HER gives only marginal improvement, LEO provides a significant boost, Dual LEO (PQN) improves further, and Dual LEO (PPO) is the best overall.

The main benchmarks are goal-conditioned Craftax and JaxGCRL ant maze tasks. CraftaxGC is partially observed, procedurally generated, and built around semantically structured discrete goals, with Craftax-Classic using a goal set of size 136 and full Craftax using a goal set of size 512. Evaluation uses mean success rate across all goals, reported as mean with standard error over 5 seeds. The paper also evaluates continuous-control ant maze by discretizing the continuous goal space into a grid and snapping continuous goals to the closest grid point.

A central empirical observation is that LEO does well on hard goals but can underperform on easy goals. The paper attributes this to a late-fusion bottleneck: because the goal is not fed into the shared trunk and only appears in the final decomposition into goal heads, the network must learn a representation that supports all goals simultaneously rather than specializing early to the commanded goal. This makes LEO strong as a broad learner yet sometimes weaker as a direct actor.

Dual LEO is introduced as the practical response to that bottleneck. In the PQN variant, a LEO Q-network and a UVFA Q-network are trained in parallel on the same data stream, and action values are mixed by

sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},7

The best reported hyperparameter for CraftaxGC Dual LEO (PQN) is linear-combination acting with sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},8 and no annealing. In the PPO variant, additional losses push the PPO policy toward

sense(t)Rm+p+n,sense(t) \in \mathbb{R}^{m+p+n},9

and the PPO value toward

in(t)in(t)0

The best reported hyperparameters in the appendix are policy cloning coefficient in(t)in(t)1, value cloning coefficient in(t)in(t)2, with annealing enabled.

A useful summary is the following.

Formulation Core mechanism Main bottleneck
ONE Incremental skill discovery followed by replay-and-consolidate retraining into one recurrent network Storage and repeated replay over growing history
LEO Reparameterize the network to output values or actions for the entire discrete goal set in one pass Finite goal set requirement and late-fusion bottleneck
Dual LEO Use LEO as teacher for a UVFA-style student or mixed controller Added architectural complexity and dependence on teacher-student complementarity

6. Relations, assumptions, and limitations

ONE is explicitly presented as an overview of prior lines of work: PowerPlay contributes incremental extension of a problem solver while preserving old competencies; “learning to think” contributes the collapse of controller and world model into a single recurrent computation; the neural history compressor and chunker-automatizer supply the template of discovering difficult skills and then compressing them by gradient descent into a shared network; and earlier predictive world models contribute future observation and reward prediction together with special goal input vectors (Schmidhuber, 2018). A plausible implication is that the ONE interpretation of LEO is chiefly about recurrent consolidation, predictive compression, and subroutine discovery.

The 2026 LEO formulation, by contrast, is not a successor-representation or successor-feature method. It is directly implemented as all-goals value and policy prediction over a discrete goal set (Matthews et al., 22 May 2026). Its main assumption is that the goal space is finite or discretizable, that goals can be represented as output heads, that the environment allows computing in(t)in(t)3 for all relevant goals, and that the output size in(t)in(t)4 or the all-goals actor-critic heads are computationally manageable.

Both formulations have important limitations. ONE is primarily a conceptual and algorithmic proposal rather than a large-scale empirical demonstration. It advocates storing essentially the whole sensorimotor life history, which is expensive, and repeated replay over many traces can become computationally heavy. Its transfer argument relies on tasks sharing algorithmic information; if tasks are unrelated, interference may dominate. It also permits forgetting during the exploratory acquisition phase and relies on replay-based reconsolidation to recover competencies. LEO, in the goal-conditioned RL sense, requires a finite goal set, does not scale naturally to high-dimensional continuous goals, can underperform as a direct actor because of the late-fusion bottleneck, and in continuous control the all-goals policy update remains expensive, with throughput reported to drop by about 70%.

Taken together, the two lines of work define complementary meanings of “Learning Everything all at Once.” ONE treats it as continual multitask consolidation into one recurrent general-purpose solver. The later LEO method treats it as parallel all-goals learning through architectural currying of the goal variable. The shared technical intuition is that a learner should not discard side information contained in trajectories: in ONE, stored traces are replayed to preserve skills and improve prediction; in LEO, each transition updates the full goal-conditioned prediction set. In that restricted but important sense, both instantiate the same broader program: compress as much useful structure as possible into one model rather than fragmenting it across isolated tasks or goals.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Learning Everything all at Once (LEO).