Horizon Reduction Makes RL Scalable (2506.04168v2)

Published 4 Jun 2025 in cs.LG and cs.AI

Abstract: In this work, we study the scalability of offline reinforcement learning (RL) algorithms. In principle, a truly scalable offline RL algorithm should be able to solve any given problem, regardless of its complexity, given sufficient data, compute, and model capacity. We investigate if and how current offline RL algorithms match up to this promise on diverse, challenging, previously unsolved tasks, using datasets up to 1000x larger than typical offline RL datasets. We observe that despite scaling up data, many existing offline RL algorithms exhibit poor scaling behavior, saturating well below the maximum performance. We hypothesize that the horizon is the main cause behind the poor scaling of offline RL. We empirically verify this hypothesis through several analysis experiments, showing that long horizons indeed present a fundamental barrier to scaling up offline RL. We then show that various horizon reduction techniques substantially enhance scalability on challenging tasks. Based on our insights, we also introduce a minimal yet scalable method named SHARSA that effectively reduces the horizon. SHARSA achieves the best asymptotic performance and scaling behavior among our evaluation methods, showing that explicitly reducing the horizon unlocks the scalability of offline RL. Code: https://github.com/seohongpark/horizon-reduction

Summary

The paper shows that standard offline RL methods suffer from the 'curse of horizon', where 1-step TD learning accumulates errors and limits scalability on long-horizon tasks.
It compares several horizon reduction techniques, such as n-step SAC+BC, hierarchical flow BC, HIQL, and SHARSA, to address these scalability issues.
Experimental results reveal that SHARSA, which reduces both value and policy horizons, consistently improves performance across complex goal-conditioned tasks with datasets up to 1B transitions.

This paper, "Horizon Reduction Makes RL Scalable" (2506.04168), investigates the scalability of offline reinforcement learning (RL) algorithms on challenging, long-horizon tasks, using datasets significantly larger than standard benchmarks. The core question explored is whether simply increasing data and compute is sufficient for current offline RL methods to solve complex problems.

The authors find that many standard offline RL algorithms exhibit poor scaling behavior on complex tasks, with performance saturating well below optimal, even when provided with datasets up to 1000x larger (1 billion transitions) than typical. They hypothesize that the primary cause of this poor scaling is the "curse of horizon," which affects both value learning and policy learning.

Key Challenges Identified:

Curse of Horizon in Value Learning: Temporal Difference (TD) learning, commonly used in many offline RL algorithms, suffers from accumulating biases (errors) over long horizons. The paper demonstrates this empirically using a didactic "combination-lock" task (2506.04168), showing that Q-errors are significantly higher for standard 1-step TD learning compared to n-step TD learning, and these errors increase with the distance from the goal state. Importantly, they show that simply increasing network size or tuning other hyperparameters does not effectively mitigate this issue, suggesting it's a fundamental limitation of deep TD learning on long horizons.
Curse of Horizon in Policy Learning: Even with a perfect value function, learning a direct mapping from states to optimal actions can be complex, especially for long horizons where optimal actions might depend on distant goals. The paper suggests that the complexity of this mapping grows with the horizon.

Proposed Solution: Horizon Reduction Techniques

Based on these insights, the paper explores and demonstrates that explicitly reducing the effective horizon in either value learning, policy learning, or both, substantially enhances the scalability of offline RL.

The paper evaluates four types of horizon reduction techniques:

n-step SAC+BC: Reduces the value horizon by using n-step TD targets instead of 1-step targets, similar to n-step DQN. This addresses the bias accumulation in value learning.
Hierarchical Flow BC (HFBC): Reduces the policy horizon by training a two-level hierarchical policy. A high-level policy learns to propose intermediate subgoals (future states), and a low-level policy learns to reach these subgoals. Both levels use flow-based behavioral cloning (BC). This simplifies the policy learning problem.
HIQL: Similar to HFBC, it uses a hierarchical policy structure extracted from a flat value function learned with a decoupled method (Implicit V-learning). This primarily reduces the policy horizon.
SHARSA (State-High-level Action-Reward-State-High-level Action): A novel, minimal method proposed by the authors that reduces both value and policy horizons. It combines hierarchical flow BC with n-step SARSA for high-level value learning.

SHARSA Implementation Details:

SHARSA aims to be simple and scalable. It uses:

Hierarchical Flow BC: Trains a high-level flow policy $\pi_\beta^h(w \mid s, g)$ to predict future states $w=g(s_{h+n})$ from state $s_h$ and final goal $g$ , and a low-level flow policy $\pi_\beta^\ell(a \mid s, w)$ to predict actions $a_h$ from state $s_h$ to reach subgoal $w=s_{h+n}$ . These are trained with standard flow-matching BC objectives on dataset trajectories.
n-step SARSA: Learns a high-level Q-function $Q^h(s, w, g)$ and V-function $V^h(s, g)$ using n-step SARSA targets. The Q-function estimates the value of reaching subgoal $w$ from state $s$ given final goal $g$ , over an n-step horizon. The authors found that using a Binary Cross-Entropy (BCE) loss for value functions (treating values as probabilities) worked better than standard regression, especially on sparse-reward tasks.
Rejection Sampling for High-level Policy Extraction: Instead of gradient-based policy optimization, the high-level policy for execution is derived by sampling multiple candidate subgoals $w_i$ from the high-level BC policy $\pi_\beta^h(s, g)$ and selecting the one with the highest value according to the learned $Q^h(s, w_i, g)$ . This avoids issues with ill-defined gradients in the state space and leverages the expressiveness of flow policies.
Low-level Policy: The low-level policy for execution can either be the low-level BC policy $\pi_\beta^\ell(s, w)$ directly (SHARSA) or use another layer of rejection sampling based on a low-level SARSA value function (Double SHARSA).

The training process involves optimizing the flow BC losses for the hierarchical policies and the SARSA losses for the hierarchical value functions simultaneously on minibatches sampled from the dataset.

for step in range(total_gradient_steps):
    # Sample a batch of n-step transitions (s_h, a_h, ..., s_{h+n}) and goals g from dataset
    batch = sample_batch_from_dataset(dataset, n_step_horizon)

    # Update high-level flow BC policy v_h
    # Minimize flow-matching loss: ||v_h(t, s_h, w^t, g) - (w - z)||
    optimize(high_level_bc_loss(batch))

    # Update low-level flow BC policy v_l
    # Minimize flow-matching loss: ||v_l(t, s_h, a^t, w) - (a - z)|| where w = s_{h+n}
    optimize(low_level_bc_loss(batch))

    # Update high-level V function V_h
    # Minimize D(V_h(s_h, g), Q_h_target(s_h, s_{h+n}, g))
    optimize(high_level_v_loss(batch))

    # Update high-level Q function Q_h
    # Minimize D(Q_h(s_h, s_{h+n}, g), sum(gamma^i * r(s_{h+i}, g)) + gamma^n * V_h(s_{h+n}, g))
    optimize(high_level_q_loss(batch))

    # Update target networks
    update_target_networks()

def pi(s, g):
    # High-level: Rejection sampling for subgoal w
    candidate_subgoals = sample_from_high_level_flow_bc(s, g, num_samples=N)
    high_level_values = evaluate_high_level_q(s, candidate_subgoals, g)
    best_subgoal = candidate_subgoals[argmax(high_level_values)]

    # Low-level: Behavioral cloning for action a
    action = sample_from_low_level_flow_bc(s, best_subgoal)
    return action

(Note: The full Double SHARSA pseudocode includes training low-level value functions and using rejection sampling for the low-level policy as well, as shown in the paper's appendix.)

Experimental Results:

The paper evaluates methods on four challenging goal-conditioned tasks from OGBench: cube-octuple, puzzle-4x5, puzzle-4x6, and humanoidmaze-giant (2506.04168), using datasets of 1M, 10M, 100M, and 1B transitions. These tasks require complex, long-horizon reasoning and manipulation.

Standard Methods (Flow BC, IQL, CRL, SAC+BC): Consistently struggle on the hardest tasks, achieving near-zero performance on cube-octuple even with 1B data. Their performance saturates quickly on other tasks. Increasing model size or adjusting other standard hyperparameters provides only limited improvements and does not solve the fundamental scaling issue.
Horizon Reduction Methods:
- n-step SAC+BC shows substantially improved scaling and asymptotic performance compared to standard SAC+BC, demonstrating the benefit of value horizon reduction.
- Hierarchical Flow BC shows improvements on tasks like cube-octuple, indicating the importance of policy horizon reduction, even without RL.
- HIQL, which primarily uses a hierarchical policy, also shows improvements on some tasks but relies on a flat value function which can still be affected by the value horizon issue.
- SHARSA (and Double SHARSA) exhibits the best overall scaling behavior and asymptotic performance among evaluated methods. It is the only method to achieve non-trivial performance on all four challenging tasks, highlighting the combined benefit of reducing both value and policy horizons. While not reaching 100% success on all tasks even with 1B data, it shows a clear scaling trend where standard methods fail to improve.

Practical Implications and Future Work:

The key takeaway for practitioners is that for complex, long-horizon problems, simply collecting more offline data and using larger standard RL models is unlikely to yield significant performance improvements due to fundamental issues related to the horizon. Techniques that explicitly tackle the horizon, such as n-step returns or hierarchical policies, are crucial for unlocking scalability.

SHARSA provides a practical starting point for building scalable offline RL agents by combining simple, robust components (BC, SARSA, flow networks) with horizon reduction via hierarchy and n-step returns. The ablation studies suggest that the choice of value learning method (SARSA vs. IQL) within SHARSA is less critical than the policy extraction method (rejection sampling preferred over gradient-based for state-space subgoals) and value loss function (BCE loss found beneficial).

The paper concludes with a call for research, emphasizing the need for developing and evaluating offline RL algorithms directly on large-scale datasets and complex tasks to assess their true scalability. Open questions include: