History-Aware Policy Optimization (HAPO)
- History-Aware Policy Optimization (HAPO) is a framework that integrates historical data to enhance policy updates in reinforcement and imitation learning.
- It leverages trajectory-level statistics and memory compression techniques to improve exploration, credit assignment, and efficiency in sparse-reward and long-horizon environments.
- Applications span language model reasoning, robotic control, and GUI navigation, demonstrating reduced computational overhead and improved performance metrics.
History-Aware Policy Optimization (HAPO) refers to a class of algorithms and frameworks in reinforcement learning (RL) and imitation learning that explicitly leverage historical data—either as trajectory-level statistics, problem-specific histories, or temporally aggregated representations—to shape policy updates, improve exploration, and induce efficient behavior. Unlike memoryless or strictly per-timestep policy optimization schemes, HAPO methods encode, compress, or reward the model’s use of history (e.g., observation-action sequences, answer length records, or moment-token memories) to address the credit assignment, exploration, and efficiency challenges that arise in long-horizon, partially-observable, or sample-inefficient regimes (Huang et al., 16 May 2025, Trivedi et al., 26 Aug 2025, Koo et al., 1 Oct 2025, Zhou et al., 1 Dec 2025).
1. Conceptual Foundations and Motivation
HAPO frameworks are motivated by several shortcomings of conventional policy optimization methods:
- Sparse reward environments: In long-horizon tasks, signal vanishes when updates are distributed per token or timestep instead of via the full trajectory. HAPO uses trajectory-level or aggregated statistics to address this issue (Trivedi et al., 26 Aug 2025).
- Exploration under trajectory correlation: Standard approaches (e.g., PPO, token-level RLHF) can collapse exploration, particularly for listwise structured outputs or RL environments where return is jointly determined by entire action sequences, not individual steps.
- Task-specific efficiency goals: In settings like LLM reasoning, where concise yet correct outputs are preferred (e.g., math problem-solving), history-based policies can reward incremental improvements over model’s own previously discovered solutions (Huang et al., 16 May 2025).
- Partial observability and long-term dependency: In robotic control, GUI navigation, or vision-language-action (VLA) models, optimal policies must condition on varying lengths of the recent context, which HAPO retrieves and compresses dynamically (Koo et al., 1 Oct 2025, Zhou et al., 1 Dec 2025).
Thus, HAPO provides a unified framework to integrate historical information into the optimization protocol—either as an optimization target (e.g., per-query best-so-far), as dynamic memory (moment tokens/anchors), or via exploration/exploitation tradeoffs aggregated over entire trajectories.
2. Methodological Templates and Key Algorithms
The term “History-Aware Policy Optimization” encompasses diverse implementations, characterized primarily by their mechanism of encoding and leveraging history. The following table summarizes representative approaches:
| Approach | History Encoding | Optimization Target |
|---|---|---|
| HAPO (Huang et al., 16 May 2025) | Per-problem minimum solution length | Reward-shaping: reward for producing solutions shorter than the best prior correct |
| HAEPO (Trivedi et al., 26 Aug 2025) | Trajectory log likelihood + returns | Weighting and exploration via Plackett–Luce softmax over entire trajectories |
| HAMLET (Koo et al., 1 Oct 2025) | Learnable moment tokens + memory network | Action prediction conditioned on compressed cross-timestep features |
| HCPO (Zhou et al., 1 Dec 2025) | Variable-length truncated history + anchor tokens | Adaptive sampling, dual-branch compression, and alignment losses |
HAPO (LLM Compression):
- Maintains a scalar “history state” for each training problem, which records the minimum length of any previous correct solution.
- Defines a combined reward: binary correctness plus a per-trace length reward that is positive if and only if the new solution is both correct and shorter than .
- Updates after each batch, so the policy incrementally “beats its own best,” compressing reasoning over training (Huang et al., 16 May 2025).
HAEPO (Exploratory RL and LLM Tuning):
- For each batch, compresses every trajectory into its summed log-likelihood .
- Computes a Plackett–Luce softmax over these, with per-trajectory weights encouraging preference for diverse, high-likelihood rollouts.
- Objective is a weighted return minus an entropy bonus (to prevent collapse) and a soft KL penalty to a frozen previous policy (for stability). Normalization of returns is batch-based (sum-normalization or z-score) (Trivedi et al., 26 Aug 2025).
HAMLET (Vision-Language-Action Policies):
- At each timestep, represents state history using moment tokens (compact learnable embeddings) extracted from the visual input via a frozen backbone.
- Aggregates these with a causal Transformer memory module, forming a history-augmented feature used for action selection.
- Initial moment tokens are pre-trained with a time-contrastive loss to ensure temporal discrimination (Koo et al., 1 Oct 2025).
HCPO (GUI Agent Optimization):
- Employs Dynamic Context Sampling (DCS), providing variable-length observation/action histories at each step, driven by a schedule that starts uniform and increasingly favors longer contexts.
- During policy updates, utilizes Anchor-guided History Compression (AHC): branches with full vs. action-only compressed histories are both optimized, and a KL alignment term ensures consistency and efficiency (Zhou et al., 1 Dec 2025).
3. Loss Functions, Regularization, and Pseudocode
HAPO-style methods typically deviate from classical per-timestep or per-token RL objectives. Some notable design elements:
- Plackett–Luce softmax (HAEPO): Batch-level weights distribute credit across entire trajectories.
- Entropy regularization: HAEPO adds with a tunable coefficient to prevent all weight concentrating on a single history, sustaining exploration.
- KL trust-region penalty: Soft regularization to a reference policy ensures update stability, with coefficient (Trivedi et al., 26 Aug 2025).
- History state update (HAPO for LLMs): For each problem, upon generating a correct solution (matches gold answer), update if the new trace is even shorter. Length reward is shaped by with careful clipping to avoid over-penalizing short but incorrect attempts (Huang et al., 16 May 2025).
- Dual-branch and compressed-history alignment: HCPO’s combines GRPO-style policy-losses on both full-history and compressed-history branches plus a KL alignment objective (Zhou et al., 1 Dec 2025).
Pseudocode provided in each source paper (see (Trivedi et al., 26 Aug 2025) for HAEPO, (Huang et al., 16 May 2025) for HAPO, (Zhou et al., 1 Dec 2025) for HCPO) demonstrates the centrality of batch-level history statistics and joint optimization over trajectory-level history as opposed to standard on-policy or off-policy gradient methods.
4. Empirical Evaluation and Applications
HAPO frameworks have been empirically tested across a range of domains, each exploiting different facets of the history-aware paradigm:
- LLM reasoning compression: HAPO achieves 33–59% reduction in reasoning chain length with only 2–5% accuracy degradation relative to base LLMs on math problems spanning GSM8K, MATH500, and AIME2024 (Huang et al., 16 May 2025).
- RL benchmarks: In multi-armed bandit settings, HAEPO matches or exceeds PPO and DPO in regret minimization, sustaining higher entropy and lower variance on large ; on 1D random walks, it attains near-optimal success rates faster than PPO and DPO while exhibiting reduced variance (Trivedi et al., 26 Aug 2025).
- LLM reward modeling: On TL;DR summarization (LLaMA, Qwen), HAEPO’s listwise, trajectory-level reward yields the best human preference scores and fastest training throughput compared to DPO and GRPO (Trivedi et al., 26 Aug 2025).
- Robotic manipulation: HAMLET, built on GR00T N1.5, lifts success rates by up to 47.2 percentage points on real-world long-horizon tasks compared to VLA baselines, with only marginal computational overhead (Koo et al., 1 Oct 2025).
- GUI navigation: HCPO enables a 3B-parameter agent to outperform much larger (7B) models on out-of-distribution GUI benchmarks (e.g., GUI-Odyssey, AndroidInTheWild) while reducing FLOPs by 60% (Zhou et al., 1 Dec 2025).
These results collectively suggest that HAPO-style architectures—by leveraging explicit history—excel in domains where long-horizon dependency, sparse rewards, or efficiency constraints are bottlenecks.
5. Limitations and Open Challenges
Despite encouraging results, multiple limitations are documented in the literature:
- Hyperparameter sensitivity: Regularization coefficients (entropy, KL), normalization schemes, and memory module sizes typically require nontrivial tuning between tasks (Trivedi et al., 26 Aug 2025, Zhou et al., 1 Dec 2025).
- Compute and memory overhead: Although techniques such as moment tokens or anchor pruning reduce incremental cost, storing, maintaining, and reprocessing batch- or task-level history can be expensive, especially for hierarchical or continual learning extensions (Trivedi et al., 26 Aug 2025, Koo et al., 1 Oct 2025).
- Episodes and generalization: HAPO methods with per-query history (HAPO for LLMs) maintain state per training example, potentially limiting sample-efficiency when generalization across problems is desired (Huang et al., 16 May 2025).
- Scale and multi-agent compatibility: Demonstrations are currently limited to tasks with steps or single-agent scenarios. Large-scale public LLM benchmarks and multi-agent settings are untested, and generalization to cross-episode or lifelong learning is an open problem (Trivedi et al., 26 Aug 2025).
- History encoding compression: Optimal strategies for history compression, anchoring, or gating (e.g., moment token number, Transformer memory depth) remain empirical and domain-specific (Koo et al., 1 Oct 2025).
6. Future Directions and Potential Extensions
Several directions are identified for extending HAPO:
- Automated regularization schedules: Meta-gradient or annealing approaches for auto-tuning entropy and KL penalties, potentially improving stability and transfer across domains (Trivedi et al., 26 Aug 2025).
- Hierarchical and continual memory: Hierarchical versions of HAEPO are proposed for ultra-long-horizon or continual settings, with truncated backpropagation or memory buffers (Trivedi et al., 26 Aug 2025, Koo et al., 1 Oct 2025).
- Adaptive memory allocation: Learning to represent and select only the most temporally informative features (e.g., via gating or attention) for history encoding (Koo et al., 1 Oct 2025).
- Integration with other RL paradigms: Joint application of history-aware weighting and on-policy RL objectives (e.g., PPO, offline RL) or expanded to joint trajectory ranking in multi-agent contexts (Trivedi et al., 26 Aug 2025).
- Broader benchmarks: Future work will likely apply HAPO to larger LLM alignment corpora (AlpacaEval, MT-Bench), real-world robotics, and life-long reasoning benchmarks when computational resources allow (Trivedi et al., 26 Aug 2025, Huang et al., 16 May 2025).
- Richer history representations: For problem-specific history, tracking more than a scalar (e.g., distributional length statistics, histograms) could potentially drive even more efficient behavioral compression (Huang et al., 16 May 2025).
7. Related Frameworks and Distinctions
HAPO subsumes and extends several earlier approaches:
- Trajectory-level and listwise RL: Unlike pairwise (DPO) or per-timestep (GRPO), HAEPO and similar listwise-weighted frameworks use trajectory-level normalization and ranking to amplify high-quality, diverse trajectories (Trivedi et al., 26 Aug 2025).
- Contrastive and memory-based architectures: Models like HAMLET demonstrate that temporally-aware contrastive pretraining and lightweight memory transformers efficiently transmit historical perceptual cues, achieving substantial improvements even with frozen vision-language backbones (Koo et al., 1 Oct 2025).
- History-enhanced policy optimization (HCPO): For domains where context size must be dynamically tuned for each decision (e.g., GUI agents), HCPO provides evidence that dual-branch, anchor-guided compression can trade off context fidelity and efficiency, outperforming fixed-window or fully compressed baselines (Zhou et al., 1 Dec 2025).
In sum, History-Aware Policy Optimization provides a rigorously tested, flexible, and increasingly adopted paradigm for addressing the challenges of sparse-reward, long-horizon, and history-dependent environments in RL, LLM alignment, and sequential decision-making (Huang et al., 16 May 2025, Trivedi et al., 26 Aug 2025, Koo et al., 1 Oct 2025, Zhou et al., 1 Dec 2025).