Papers
Topics
Authors
Recent
Search
2000 character limit reached

History-Aware Policy Optimization (HAPO)

Updated 1 March 2026
  • History-Aware Policy Optimization (HAPO) is a framework that integrates historical data to enhance policy updates in reinforcement and imitation learning.
  • It leverages trajectory-level statistics and memory compression techniques to improve exploration, credit assignment, and efficiency in sparse-reward and long-horizon environments.
  • Applications span language model reasoning, robotic control, and GUI navigation, demonstrating reduced computational overhead and improved performance metrics.

History-Aware Policy Optimization (HAPO) refers to a class of algorithms and frameworks in reinforcement learning (RL) and imitation learning that explicitly leverage historical data—either as trajectory-level statistics, problem-specific histories, or temporally aggregated representations—to shape policy updates, improve exploration, and induce efficient behavior. Unlike memoryless or strictly per-timestep policy optimization schemes, HAPO methods encode, compress, or reward the model’s use of history (e.g., observation-action sequences, answer length records, or moment-token memories) to address the credit assignment, exploration, and efficiency challenges that arise in long-horizon, partially-observable, or sample-inefficient regimes (Huang et al., 16 May 2025, Trivedi et al., 26 Aug 2025, Koo et al., 1 Oct 2025, Zhou et al., 1 Dec 2025).

1. Conceptual Foundations and Motivation

HAPO frameworks are motivated by several shortcomings of conventional policy optimization methods:

  • Sparse reward environments: In long-horizon tasks, signal vanishes when updates are distributed per token or timestep instead of via the full trajectory. HAPO uses trajectory-level or aggregated statistics to address this issue (Trivedi et al., 26 Aug 2025).
  • Exploration under trajectory correlation: Standard approaches (e.g., PPO, token-level RLHF) can collapse exploration, particularly for listwise structured outputs or RL environments where return is jointly determined by entire action sequences, not individual steps.
  • Task-specific efficiency goals: In settings like LLM reasoning, where concise yet correct outputs are preferred (e.g., math problem-solving), history-based policies can reward incremental improvements over model’s own previously discovered solutions (Huang et al., 16 May 2025).
  • Partial observability and long-term dependency: In robotic control, GUI navigation, or vision-language-action (VLA) models, optimal policies must condition on varying lengths of the recent context, which HAPO retrieves and compresses dynamically (Koo et al., 1 Oct 2025, Zhou et al., 1 Dec 2025).

Thus, HAPO provides a unified framework to integrate historical information into the optimization protocol—either as an optimization target (e.g., per-query best-so-far), as dynamic memory (moment tokens/anchors), or via exploration/exploitation tradeoffs aggregated over entire trajectories.

2. Methodological Templates and Key Algorithms

The term “History-Aware Policy Optimization” encompasses diverse implementations, characterized primarily by their mechanism of encoding and leveraging history. The following table summarizes representative approaches:

Approach History Encoding Optimization Target
HAPO (Huang et al., 16 May 2025) Per-problem minimum solution length Reward-shaping: reward for producing solutions shorter than the best prior correct
HAEPO (Trivedi et al., 26 Aug 2025) Trajectory log likelihood + returns Weighting and exploration via Plackett–Luce softmax over entire trajectories
HAMLET (Koo et al., 1 Oct 2025) Learnable moment tokens + memory network Action prediction conditioned on compressed cross-timestep features
HCPO (Zhou et al., 1 Dec 2025) Variable-length truncated history + anchor tokens Adaptive sampling, dual-branch compression, and alignment losses

HAPO (LLM Compression):

  • Maintains a scalar “history state” hih_i for each training problem, which records the minimum length of any previous correct solution.
  • Defines a combined reward: binary correctness plus a per-trace length reward that is positive if and only if the new solution is both correct and shorter than hih_i.
  • Updates hih_i after each batch, so the policy incrementally “beats its own best,” compressing reasoning over training (Huang et al., 16 May 2025).

HAEPO (Exploratory RL and LLM Tuning):

  • For each batch, compresses every trajectory into its summed log-likelihood LkL_k.
  • Computes a Plackett–Luce softmax over these, with per-trajectory weights wkw_k encouraging preference for diverse, high-likelihood rollouts.
  • Objective is a weighted return minus an entropy bonus (to prevent collapse) and a soft KL penalty to a frozen previous policy (for stability). Normalization of returns is batch-based (sum-normalization or z-score) (Trivedi et al., 26 Aug 2025).

HAMLET (Vision-Language-Action Policies):

  • At each timestep, represents state history using moment tokens (compact learnable embeddings) extracted from the visual input via a frozen backbone.
  • Aggregates these with a causal Transformer memory module, forming a history-augmented feature used for action selection.
  • Initial moment tokens are pre-trained with a time-contrastive loss to ensure temporal discrimination (Koo et al., 1 Oct 2025).

HCPO (GUI Agent Optimization):

  • Employs Dynamic Context Sampling (DCS), providing variable-length observation/action histories at each step, driven by a schedule that starts uniform and increasingly favors longer contexts.
  • During policy updates, utilizes Anchor-guided History Compression (AHC): branches with full vs. action-only compressed histories are both optimized, and a KL alignment term ensures consistency and efficiency (Zhou et al., 1 Dec 2025).

3. Loss Functions, Regularization, and Pseudocode

HAPO-style methods typically deviate from classical per-timestep or per-token RL objectives. Some notable design elements:

  • Plackett–Luce softmax (HAEPO): Batch-level weights wk=exp(Lk)/jexp(Lj)w_k = \exp(L_k) / \sum_j \exp(L_j) distribute credit across entire trajectories.
  • Entropy regularization: HAEPO adds H(w)=kwklogwkH(w) = -\sum_k w_k \log w_k with a tunable coefficient βent\beta_{ent} to prevent all weight concentrating on a single history, sustaining exploration.
  • KL trust-region penalty: Soft regularization to a reference policy ensures update stability, with coefficient λ\lambda (Trivedi et al., 26 Aug 2025).
  • History state update (HAPO for LLMs): For each problem, upon generating a correct solution (matches gold answer), update hih_i if the new trace is even shorter. Length reward is shaped by f(,h)=cos(min(π2h,π))f(\ell, h) = \cos(\min(\frac{\pi}{2} \frac{\ell}{h}, \pi)) with careful clipping to avoid over-penalizing short but incorrect attempts (Huang et al., 16 May 2025).
  • Dual-branch and compressed-history alignment: HCPO’s LHCPO\mathcal{L}_{\mathrm{HCPO}} combines GRPO-style policy-losses on both full-history and compressed-history branches plus a KL alignment objective (Zhou et al., 1 Dec 2025).

Pseudocode provided in each source paper (see (Trivedi et al., 26 Aug 2025) for HAEPO, (Huang et al., 16 May 2025) for HAPO, (Zhou et al., 1 Dec 2025) for HCPO) demonstrates the centrality of batch-level history statistics and joint optimization over trajectory-level history as opposed to standard on-policy or off-policy gradient methods.

4. Empirical Evaluation and Applications

HAPO frameworks have been empirically tested across a range of domains, each exploiting different facets of the history-aware paradigm:

  • LLM reasoning compression: HAPO achieves 33–59% reduction in reasoning chain length with only 2–5% accuracy degradation relative to base LLMs on math problems spanning GSM8K, MATH500, and AIME2024 (Huang et al., 16 May 2025).
  • RL benchmarks: In multi-armed bandit settings, HAEPO matches or exceeds PPO and DPO in regret minimization, sustaining higher entropy and lower variance on large KK; on 1D random walks, it attains near-optimal success rates faster than PPO and DPO while exhibiting reduced variance (Trivedi et al., 26 Aug 2025).
  • LLM reward modeling: On TL;DR summarization (LLaMA, Qwen), HAEPO’s listwise, trajectory-level reward yields the best human preference scores and fastest training throughput compared to DPO and GRPO (Trivedi et al., 26 Aug 2025).
  • Robotic manipulation: HAMLET, built on GR00T N1.5, lifts success rates by up to 47.2 percentage points on real-world long-horizon tasks compared to VLA baselines, with only marginal computational overhead (Koo et al., 1 Oct 2025).
  • GUI navigation: HCPO enables a 3B-parameter agent to outperform much larger (7B) models on out-of-distribution GUI benchmarks (e.g., GUI-Odyssey, AndroidInTheWild) while reducing FLOPs by 60% (Zhou et al., 1 Dec 2025).

These results collectively suggest that HAPO-style architectures—by leveraging explicit history—excel in domains where long-horizon dependency, sparse rewards, or efficiency constraints are bottlenecks.

5. Limitations and Open Challenges

Despite encouraging results, multiple limitations are documented in the literature:

  • Hyperparameter sensitivity: Regularization coefficients (entropy, KL), normalization schemes, and memory module sizes typically require nontrivial tuning between tasks (Trivedi et al., 26 Aug 2025, Zhou et al., 1 Dec 2025).
  • Compute and memory overhead: Although techniques such as moment tokens or anchor pruning reduce incremental cost, storing, maintaining, and reprocessing batch- or task-level history can be expensive, especially for hierarchical or continual learning extensions (Trivedi et al., 26 Aug 2025, Koo et al., 1 Oct 2025).
  • Episodes and generalization: HAPO methods with per-query history (HAPO for LLMs) maintain state per training example, potentially limiting sample-efficiency when generalization across problems is desired (Huang et al., 16 May 2025).
  • Scale and multi-agent compatibility: Demonstrations are currently limited to tasks with 103\lesssim 10^3 steps or single-agent scenarios. Large-scale public LLM benchmarks and multi-agent settings are untested, and generalization to cross-episode or lifelong learning is an open problem (Trivedi et al., 26 Aug 2025).
  • History encoding compression: Optimal strategies for history compression, anchoring, or gating (e.g., moment token number, Transformer memory depth) remain empirical and domain-specific (Koo et al., 1 Oct 2025).

6. Future Directions and Potential Extensions

Several directions are identified for extending HAPO:

  • Automated regularization schedules: Meta-gradient or annealing approaches for auto-tuning entropy and KL penalties, potentially improving stability and transfer across domains (Trivedi et al., 26 Aug 2025).
  • Hierarchical and continual memory: Hierarchical versions of HAEPO are proposed for ultra-long-horizon or continual settings, with truncated backpropagation or memory buffers (Trivedi et al., 26 Aug 2025, Koo et al., 1 Oct 2025).
  • Adaptive memory allocation: Learning to represent and select only the most temporally informative features (e.g., via gating or attention) for history encoding (Koo et al., 1 Oct 2025).
  • Integration with other RL paradigms: Joint application of history-aware weighting and on-policy RL objectives (e.g., PPO, offline RL) or expanded to joint trajectory ranking in multi-agent contexts (Trivedi et al., 26 Aug 2025).
  • Broader benchmarks: Future work will likely apply HAPO to larger LLM alignment corpora (AlpacaEval, MT-Bench), real-world robotics, and life-long reasoning benchmarks when computational resources allow (Trivedi et al., 26 Aug 2025, Huang et al., 16 May 2025).
  • Richer history representations: For problem-specific history, tracking more than a scalar (e.g., distributional length statistics, histograms) could potentially drive even more efficient behavioral compression (Huang et al., 16 May 2025).

HAPO subsumes and extends several earlier approaches:

  • Trajectory-level and listwise RL: Unlike pairwise (DPO) or per-timestep (GRPO), HAEPO and similar listwise-weighted frameworks use trajectory-level normalization and ranking to amplify high-quality, diverse trajectories (Trivedi et al., 26 Aug 2025).
  • Contrastive and memory-based architectures: Models like HAMLET demonstrate that temporally-aware contrastive pretraining and lightweight memory transformers efficiently transmit historical perceptual cues, achieving substantial improvements even with frozen vision-language backbones (Koo et al., 1 Oct 2025).
  • History-enhanced policy optimization (HCPO): For domains where context size must be dynamically tuned for each decision (e.g., GUI agents), HCPO provides evidence that dual-branch, anchor-guided compression can trade off context fidelity and efficiency, outperforming fixed-window or fully compressed baselines (Zhou et al., 1 Dec 2025).

In sum, History-Aware Policy Optimization provides a rigorously tested, flexible, and increasingly adopted paradigm for addressing the challenges of sparse-reward, long-horizon, and history-dependent environments in RL, LLM alignment, and sequential decision-making (Huang et al., 16 May 2025, Trivedi et al., 26 Aug 2025, Koo et al., 1 Oct 2025, Zhou et al., 1 Dec 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to History-Aware Policy Optimization (HAPO).