Papers
Topics
Authors
Recent
2000 character limit reached

History Context-aware Policy Optimization

Updated 8 December 2025
  • History Context-aware Policy Optimization (HCPO) is a framework that leverages historical trajectory data to inform decision-making in non-Markovian environments.
  • It employs techniques such as full-trajectory aggregation, adaptive curriculum, and history compression to enhance training stability and long-horizon credit assignment.
  • HCPO has demonstrated significant empirical gains in robotics, language model alignment, and GUI agents by effectively utilizing past context.

History Context-aware Policy Optimization (HCPO) refers to a suite of techniques for reinforcement learning (RL) and sequential decision making where policies explicitly account for, leverage, or optimize over historical information beyond the current state. These approaches generalize classical Markovian policy optimization by integrating history-dependent context, aggregation, or curriculum, thereby addressing non-Markov environments, long-horizon dependencies, and tasks with complex history-conditioned targets. HCPO methods have been instantiated in RL for control, recommendation, vision-language-action models, LLM alignment, GUI agents, and mathematical reasoning.

1. Theoretical Foundations of History Context-aware Policy Optimization

HCPO formally departs from the Markov Decision Process (MDP) setting by treating the agent’s state as augmented with latent or observable history-dependent context statistics. The Dynamic Contextual MDP (DCMDP) framework explicitly represents context evolution as non-Markov functions of historical state-action pairs. Let a trajectory history at timestep hh be

histh=(s1,a1,x1,,sh1,ah1,xh1)\textrm{hist}_h = (s_1,a_1,x_1,\dots, s_{h-1},a_{h-1},x_{h-1})

with context transition

xhP(xhhisth)x_h \sim P(x_h \mid \textrm{hist}_h)

and subsequent reward and dynamics r(sh,ah,xh)r(s_h,a_h,x_h), sh+1P(sh+1sh,ah,xh)s_{h+1}\sim P(s_{h+1}\mid s_h,a_h,x_h). The agent’s policy π(sh,histh)\pi(s_h, \textrm{hist}_h) depends on the full history. The logistic DCMDP subclass leverages a compressed feature statistic ϕ(histh;θ)\phi(\textrm{hist}_h;\theta) (e.g., a discounted sum of feature maps), with softmax context transitions. Model-based planning or RL, such as the DCZero algorithm, is then performed over the augmented state (s,ϕ)(s,\phi), supporting non-Markovian long-horizon credit assignment and improved regret under history-dependent dynamics (Tennenholtz et al., 2023).

This theoretical structure underpins practical HCPO algorithms across diverse domains. The benefit of HCPO arises in environments where the true task dynamics, utility, or decision optimality depend on a nontrivial function of the trajectory, not solely the instantaneous state. Typical examples include delayed outcomes, evolving user preferences in recommendations, or action policies where historical cues are necessary for disambiguation or alignment.

2. Algorithmic Implementations and Loss Functions

HCPO methods instantiate various algorithmic mechanisms to incorporate historical context:

  • Full-Trajectory Aggregation and Listwise Weighting: HAEPO (History-Aggregated Exploratory Policy Optimization) compresses an entire RL trajectory τ\tau into a scalar

L(τ;θ)=t=1Tlogπθ(atst)L(\tau;\theta) = \sum_{t=1}^T \log \pi_\theta(a_t | s_t)

and constructs a Plackett–Luce softmax weighting

wk(θ)=exp(ηL(τk;θ))jexp(ηL(τj;θ))w_k(\theta) = \frac{\exp(\eta L(\tau_k;\theta))}{\sum_j \exp(\eta L(\tau_j;\theta))}

over batch trajectories, yielding a full-trajectory, exploration-promoting learning signal. The loss

L(θ)=k=1MwkR~k+βkwklogwk+λkwk(logwklogwkref)L(\theta) = -\sum_{k=1}^M w_k \tilde{R}_k + \beta \sum_k w_k \log w_k + \lambda \sum_k w_k \left( \log w_k - \log w^{\rm ref}_k \right)

combines this listwise reward with entropy and KL regularization against a reference policy, allowing robust, low-variance optimization across long horizons (Trivedi et al., 26 Aug 2025).

  • History-Scoping and Adaptive Curriculum: HCPO can sample history windows of varying length during RL training, as in Dynamic Context Sampling (DCS) for GUI agents. Here, the history length τi\tau_i is drawn from a parameterized schedule P(τiu)P(\tau_i | u), where uu is the training step, enabling the agent to learn when to use short versus long context and progressively bias toward full history as training matures (Zhou et al., 1 Dec 2025).
  • Action-Anchored History Compression: To reduce computational cost, models such as HiconAgent use a dual-branch architecture: an uncompressed branch with full historical visual/auditory observations, and a compressed branch that retains only action history as "anchors" after an early fusion layer. These are aligned using a KL-divergence loss, and at inference, only the efficient compressed branch is executed (Zhou et al., 1 Dec 2025).
  • Length-Progressive Multi-Stage RL: MiroMind-M1’s context-aware multi-stage policy optimization (also referred to as CAMPO) introduces a staged rollout curriculum (with gradually increasing max-allowed rollout lengths) and an adaptive repetition penalty to discourage redundant reasoning in mathematical LLM alignment. The objective includes both group-normalized rewards and repetition penalties per rollout, improving token efficiency and training stability (Li et al., 19 Jul 2025).
  • History-Aware Auxiliary Rewards: In sequence modeling (e.g., HAPO), a history variable records the minimum length of correct outputs for a given query. The reward structure incentivizes new outputs that are both correct and more concise than the best-so-far, employing a clipped cosine function to smoothly reward improvements while avoiding severe penalties for exploratory shorter but incorrect generations (Huang et al., 16 May 2025).
  • Model-Based Planning with History-Dependent Latents: DCZero adapts MuZero for DCMDP scenarios, planning in a latent state consisting of the current state and compressed (ensemble-regularized) history statistic, supporting optimism under uncertainty and non-Markovian context transitions (Tennenholtz et al., 2023).

3. Empirical Performance and Application Domains

HCPO has demonstrated state-of-the-art or competitive empirical results across a broad set of domains:

  • Robotic Manipulation & Vision-Language-Action (VLA) Models: HAMLET wraps pretrained VLAs with a moment token and memory module, enabling history-conditioned action prediction with minimal overhead. On long-horizon manipulation tasks, HAMLET raises average success by up to 47.2 points over vanilla VLA baselines. In RoboCasa Kitchen and LIBERO benchmarks, history context enables further gains (66.4% and 97.7% success, respectively) (Koo et al., 1 Oct 2025).
  • LLM Conciseness and Reasoning: HAPO achieves 33–59% reductions in average answer length with only 2–5 percentage points accuracy drop on math benchmarks (GSM8K, MATH500, AIME2024), outperforming universal and in-batch length-constraining baselines (Huang et al., 16 May 2025). MiroMind-M1, using context-aware multi-stage RL, achieves superior performance (AIME24=77.5) and better token efficiency relative to Skywork-OR1-32B and DeepSeek baselines (Li et al., 19 Jul 2025).
  • GUI Sequential Agents: HiconAgent, using both DCS and AHC, attains up to +8.46 percentage points grounding accuracy and 2.47× speedup over larger baseline models on GUI-Odyssey, achieving significant FLOPs reduction (–60%) without compromising core success rates (Zhou et al., 1 Dec 2025).
  • Recommendation Systems: DCZero, optimizing over latent context statistics in DCMDPs, consistently surpasses both Markovian and transformer-history baselines on the MovieLens user recommendation environment, especially as the intensity of history-dependence increases (Tennenholtz et al., 2023).

The following table summarizes representative empirical results:

Domain HCPO Instantiation Notable Metric Relative Gain
Robotics/VLA HAMLET Real-world history-dependent tasks, avg. success 76.4% (+47.2 pts vs. baseline)
LLM Math RL HAPO Token reduction (DeepSeek-R1-1.5B, Pass@1 accuracy) –49% tokens, –1.8 pp accuracy
LLM Math RL MiroMind-M1 CAMPO AIME24 Benchmark (32B model) 77.5 (+6.7 pp vs. baseline)
GUI Agents HiconAgent (DCS+AHC) GUI-Odyssey, grounding accuracy, FLOPs reduction +8.46 pp, –60% FLOPs
Recommendation DCZero MovieLens, cumulative reward/long-term performance Best at increasing α\alpha

4. Architectural and Implementation Considerations

HCPO algorithm selection and design involve multiple architectural and hyper-parameter trade-offs specific to the target application:

  • History Representation: Choices range from full-trajectory log-likelihoods, token-level moment embeddings, scalar summary statistics, or fully latent feature aggregation (e.g., DCMDP feature maps).
  • Policy Update Mechanism: Listwise Plackett–Luce weighting (HAEPO), group-relative normalization (GRPO/CAMPO/HiconAgent), curriculum learning (multi-stage or history-length progressive), and auxiliary reward design (HAPO) all offer unique inductive biases. Selection of entropy/regularization (β, λ, KL penalties) is critical for stability.
  • History Usage Efficiency: Dual-branch architectures and anchor-based compression (as in GUI agents) allow efficient scaling by avoiding quadratic attention or redundant processing of historical context at inference.
  • Hyper-parameter Sensitivity and Curriculum Scheduling: Performance and stability are typically sensitive to entropy and consistency weights (β, λ), softmax sharpness (η), rollout length schedules, repetition penalties, and batch size. For LLM alignment, multi-stage curricula enable smoother optimization and faster throughput.
  • Transfer and Generalizability: In robotic and vision-language domains, lightweight and modular history-aware memory can generalize across task distributions with minimal retraining, suggesting strong potential for plug-and-play history conditioning in pretrained architectures (Koo et al., 1 Oct 2025).

5. Limitations, Open Challenges, and Future Directions

While HCPO frameworks yield substantial improvements in history-sensitive or non-Markovian tasks, several open limitations persist:

  • Scalability to Ultra-Long Horizons: Existing empirical validations for listwise/aggregate HCPO are primarily in horizons up to 10310^3 steps; extension to multi-thousand or multi-agent coordination remains unproven (Trivedi et al., 26 Aug 2025).
  • History Statistic Expressivity: Current methods often use compressed, scalar, or fixed embedding statistics; richer, learned, or probabilistic representations (e.g., nonparametric or Bayesian latent histories) remain underexplored.
  • Reward and Advantage Estimation: Noisy or imperfect verification can destabilize optimization; higher-fidelity reward models or auxiliary calibrations can improve outcomes (as shown for verifier upgrades in MiroMind-M1 (Li et al., 19 Jul 2025)).
  • Generalization Beyond Domain-Specific Benchmarks: While HCPO’s benefits are clear in math, GUI, recommendation, and RL control, the transferability to more ambiguous, real-world, multi-modal, or robust human alignment tasks is an active research frontier.
  • Unified Theoretical Guarantees: Regret bounds and convergence rates are largely limited to structured cases (e.g., logistic DCMDPs (Tennenholtz et al., 2023)). Further development is required for deep RL instantiations with function approximation and adaptive curriculum.

Potential directions include the development of hybrid memory architectures, data-driven or learned episode-specific history compression, richer forms of history-tracking (distributional, percentile, or composite), and deployment for large-scale multi-task RL, sequential program synthesis, or model-based RL in highly non-Markov environments.

6. Historical Trajectory and Synoptic View

The emergence of HCPO reflects the growing recognition that many real-world sequential decision processes—ranging from personalized recommendations, robotics, alignment of generative models, to human-computer interfaces—require explicit modeling and optimization over historical dependencies. Early approaches focused on richer context representation (e.g., DCMDP theory (Tennenholtz et al., 2023)); subsequent developments introduced efficient trajectory aggregation, adaptive curriculum, and explicit history rewards (HAEPO (Trivedi et al., 26 Aug 2025), CAMPO (Li et al., 19 Jul 2025), HAMLET (Koo et al., 1 Oct 2025), HAPO (Huang et al., 16 May 2025)). Recent work in GUI navigation and efficient inference, such as HiconAgent (Zhou et al., 1 Dec 2025), underscores the field’s movement toward practical, scalable, and cost-efficient history-conditioning mechanisms suitable for real-world agents operating in non-Markovian, high-dimensional, or resource-constrained settings.

History context-aware policy optimization thus stands as a central methodological pillar for the next generation of robust, general-purpose RL and sequential modeling systems, equipped to learn from, exploit, and reason about the temporal structures intrinsic to their environments.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to History Context-aware Policy Optimization (HCPO).