Papers
Topics
Authors
Recent
2000 character limit reached

Heuristic Transformer (HT) Overview

Updated 20 November 2025
  • Heuristic Transformer (HT) is a framework that combines Transformer architectures with explicit belief or heuristic augmentation to address sequential decision-making and combinatorial planning challenges.
  • The model features a two-phase training process using a VAE for belief inference and a causal transformer policy for action prediction, enabling efficient learning in RL and planning.
  • Empirical results show HT’s benefits in sample efficiency, reduced planning nodes, and improved performance in both reinforcement learning tasks and risk-aware path planning.

The term Heuristic Transformer (HT) denotes two distinct, independently-developed models that employ Transformer architectures for leveraging inductive biases in sequential decision-making and combinatorial planning. One instantiation arises in the context of belief-augmented in-context reinforcement learning, addressing rapid adaptation to new Markov Decision Processes. The other variant, developed for advanced air mobility planning, focuses on transformer-learned heuristics for accelerating risk-aware path planning in time-critical domains. Both leverage the Transformer’s ability to encode complex structure and generalize efficiently, but differ fundamentally in architecture, algorithmic integration, and application domain.

1. Belief-Augmented In-Context RL: Heuristic Transformer (HT)

Heuristic Transformer in reinforcement learning recasts In-Context Reinforcement Learning (ICRL) as a form of supervised learning by leveraging transformer sequence models, further augmenting them via explicit belief inference over reward functions. Let TpreT_{\mathrm{pre}} denote a distribution over MDP tasks τ=(S,A,T,R,H,ω,γ)\tau = (S,A,T,R,H,\omega,\gamma). Each task provides an in-context dataset

D={(sj,aj,rj,sj)}j=1nDpre(;τ)D = \{(s_j, a_j, r_j, s'_j)\}_{j=1}^n \sim D_{\mathrm{pre}}(\cdot; \tau)

and the goal is to predict the optimal action in a query state squerys_{\mathrm{query}}. Standard ICRL minimizes

minθ  EτTpre,DDpre(τ),squeryω[logπθ(aD,squery)],\min_{\theta} \; \mathbb{E}_{\tau \sim T_{\mathrm{pre}}, D \sim D_{\mathrm{pre}}(\tau), s_{\mathrm{query}} \sim \omega} \Big[ -\log \pi_\theta(a^* | D, s_{\mathrm{query}}) \Big],

with aπτ(squery)a^* \sim \pi^*_\tau(\cdot | s_{\mathrm{query}}). The Heuristic Transformer enhances this paradigm by inferring a belief bb over the unknown reward RR and prompting the transformer policy on (D,b,squery)(D, b, s_{\mathrm{query}}), yielding the heuristic Bayesian policy πθ(aD,b,squery)\pi_\theta(a | D, b, s_{\mathrm{query}}) (Dippel et al., 13 Nov 2025).

2. Architectural Design and Inference

HT operates in two distinct phases:

Phase 1: Belief Inference via VAE

A variational auto-encoder (VAE) infers a low-dimensional latent variable mRdm \in \mathbb{R}^d representing the posterior over rewards:

  • Encoder: qϕ(mη:h)q_\phi(m | \eta_{:h}) with η:h={(sj,aj,rj,sj)}j=1h\eta_{:h} = \{(s_j, a_j, r_j, s'_j)\}_{j=1}^h
  • Decoder: pΦ(r:hm,η:h1)=j=1hpΦ(rjsj1,aj1,sj1,m)p_\Phi(r_{:h} | m, \eta_{:h-1}) = \prod_{j=1}^h p_\Phi(r_j | s_{j-1}, a_{j-1}, s'_{j-1}, m)

The evidence lower bound (ELBO) for each hh is

logpΦ(r:hη:h1)Eqϕ(mη:h)[logpΦ(r:hm,η:h1)] KL(qϕ(mη:h)pΦ(m))=:ELBOh.\begin{aligned} \log p_\Phi(r_{:h} | \eta_{:h-1}) &\geq \mathbb{E}_{q_\phi(m | \eta_{:h})}[\log p_\Phi(r_{:h} | m, \eta_{:h-1})] \ &\qquad - \mathrm{KL}(q_\phi(m | \eta_{:h}) \| p_\Phi(m)) =: \mathrm{ELBO}_h. \end{aligned}

Phase 2: Transformer Policy

A causal GPT-style transformer MθM_\theta is prompted by [squery,mb,(s1,a1,r1,s1),...,(sn,an,rn,sn)][s_{\mathrm{query}}, \underbrace{m}_{b}, (s_1,a_1,r_1,s'_1), ..., (s_n,a_n,r_n,s'_n)] (without positional embeddings to reflect unordered context) and outputs the action distribution. At inference, a sample mqϕ(mD)m \sim q_\phi(m | D) is used as the belief input.

3. Training Regimen

The joint training algorithm consists of:

  • Belief VAE Stage: Parameters (ϕ,Φ)(\phi, \Phi) are trained via gradient descent to maximize the task-averaged sum of ELBOs, with KL regularization.
  • Transformer Policy Stage: The policy parameters θ\theta are trained by cross-entropy on the likelihood of the optimal action, with weight decay using AdamW regularization. No further parameter updates are performed at test time; adaptation occurs through updated context bb.

Pseudocode:

Both phases are explicitly specified in algorithmic form, ensuring reproducibility and rigour in implementation (Dippel et al., 13 Nov 2025).

4. Experimental Results and Empirical Analysis

HT is evaluated across three domains:

Algorithm Hopper Walker2d HalfCheetah Swimmer
PPO 1710.8±523.91710.8\pm523.9 2267.6±1020.82267.6\pm1020.8 1646.7±108.11646.7\pm108.1 119.1±2.2119.1\pm2.2
SAC 1839.3±164.81839.3\pm164.8 5252.4±51.55252.4\pm51.5 2328.1±11.92328.1\pm11.9 143.5±4.9143.5\pm4.9
DPT-SP 1620.1±313.91620.1\pm313.9 3099.3±432.73099.3\pm432.7 1878.8±61.01878.8\pm61.0 123.5±5.3123.5\pm5.3
HT-P 1494.4±431.91494.4\pm431.9 1900.4±890.61900.4\pm890.6 1524.1±124.71524.1\pm124.7 115.2±3.9115.2\pm3.9
HT-S 1541.3±384.21541.3\pm384.2 3020.5±408.83020.5\pm408.8 1905.4±56.71905.4\pm56.7 121.9±3.7121.9\pm3.7
HT-SP 1711.5±317.11711.5\pm317.1 3565.2±433.23565.2\pm433.2 1968.3±60.51968.3\pm60.5 133.0±3.9133.0\pm3.9

Empirical evidence demonstrates that HT consistently surpasses Decision Pre-trained Transformers (DPT) and Generalist Function Transformers (GFT) in return, sample efficiency, and robustness to stochasticity, across both discrete (Darkroom), visual (Miniworld), and continuous (MuJoCo) tasks.

Ablation studies show that the two-phase VAE+policy structure is essential; a single-phase "HT(MO)" fails to achieve similar returns. On multi-armed bandits, HT achieves regret competitive with UCB/Thompson Sampling, outperforming DPT despite lacking explicit exploration modules (Dippel et al., 13 Nov 2025).

5. Theoretical Properties and Limitations

The HT policy approximates the Bayesian RL (BAMDP) setting by inferring a belief bb as a low-dimensional sufficient statistic over reward uncertainty, but only over RR, not over transition dynamics TT. Therefore, HT is a heuristic—rather than Bayes-optimal—Bayesian policy. This belief augmentation accelerates generalization to new tasks by aligning the transformer’s attention with reward structure, surpassing context-only models.

Constraints include reliance on optimal action demonstrations at pretraining and a requirement for sufficient offline experience. The lack of transition uncertainty inference limits theoretical guarantees. Addressing these will be important for extending full Bayesian decision-making in future research (Dippel et al., 13 Nov 2025).

6. Heuristic Transformers in Risk-Aware Path Planning

Independent of the RL setting, "Heuristic Transformer" also refers to Transformer-based heuristic function learners for constrained shortest path (CSP) planning under safety constraints (Xiang et al., 21 Nov 2024). Here, the problem is formulated on a grid graph G=(N,E)G=(N,E) where each node uu carries a survival probability S(u)S(u) and movement is subject to both distance minimization and cumulative survival exceeding ϵ\epsilon: minz(u,v)Ecuvzuvsubject to flow conservation, safety, and subtour constraints\min_{z} \sum_{(u,v)\in E} c_{uv} \, z_{uv} \quad \text{subject to flow conservation, safety, and subtour constraints}

HT is integrated into ASD A* search as a learned heuristic H\mathcal{H}, computed via transformer models (Riskmap2.0 or Riskmap-State) that encode risk maps, start/goal indices, and node-specific information. The model outputs either a per-node classified heuristic (Riskmap2.0) or a real-valued estimate (Riskmap-State), both trained on supervised datasets labeled by exact CSP solutions.

These models achieve significant reductions in nodes expanded (up to 39.5%) and planning time (up to 24%) with minimal losses in success-weighted path length (SPL), even on large and realistic maps. Strict admissibility is not enforced but is empirically observed (Xiang et al., 21 Nov 2024).

7. Comparative Impact and Outlook

Whether applied to RL or to CSP planning, the Heuristic Transformer framework demonstrates the practical benefit of combining learned inductive structure with strong domain and architectural priors. In RL, explicit reward-belief modeling drives efficient and robust adaptation. In CSP planning, transformer-learned heuristics enable real-time deployment on safety-critical UAV tasks by accelerating classical search without compromising constraint satisfaction.

A plausible implication is that explicit belief or heuristic augmentation—rather than end-to-end RL or planning—will remain a critical component for sample efficiency and generalization in high-stakes, partially-observed, or combinatorially complex environments. Ongoing research directions include extending belief modeling to transition uncertainty for RL, enforcing explicit admissibility for planning heuristics, and relaxing supervised data assumptions (Dippel et al., 13 Nov 2025, Xiang et al., 21 Nov 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Heuristic Transformer (HT).