Decision Pre-Trained Transformer (DPT)

Updated 4 July 2026

DPT is a family of transformer-based decision models that pretrain on large datasets to predict optimal actions via in-context sequence modeling.
It employs supervised pretraining and parameter-efficient adaptations, such as prompt tuning or LoRA, to specialize pre-trained backbones for control tasks.
Empirical results demonstrate that DPT achieves competitive regret bounds and improved efficiency in diverse settings including robotics, HVAC control, and multi-agent RL.

Decision Pre-Trained Transformer (DPT) denotes a family of transformer-based decision models in which pretraining precedes downstream reinforcement learning or control use. In one explicit formulation, DPT is a supervised pretraining method where a transformer predicts an optimal action given a query state and an in-context dataset of interactions, and at test time no weights are updated (Lee et al., 2023). In other uses, DPT refers more broadly to a Decision Transformer whose sequence model is initialized from a pre-trained LLM such as GPT-2 or DistilGPT2 and then specialized to control or offline RL through fine-tuning or parameter-efficient adaptation (Zhang et al., 2024, Yang et al., 2024). Domain-specific instantiations, including HVAC-DPT and large-scale multi-domain in-context RL systems, preserve the same core idea: decision-making is cast as conditional sequence modeling over trajectories, prompts, or in-context datasets, with adaptation occurring through context rather than, or in addition to, gradient updates (Berkes, 2024, Polubarov et al., 6 Apr 2026).

1. Conceptual scope and relation to Decision Transformer

DPT emerged from the Decision Transformer paradigm, which models trajectories as token sequences such as $(\hat r_t,s_t,a_t)$ or $(R_t,s_t,a_t)$ and trains a causal transformer to predict actions autoregressively (Rietz et al., 7 Feb 2025, Zhang et al., 2024). The distinctive feature of DPT is the role of pretraining. In the in-context RL formulation, pretraining is supervised over a distribution of tasks so that the model learns to infer the optimal action from a query state and a context of transitions; in PLM-initialized variants, the transformer backbone is first trained on large-scale non-RL data and then transferred to decision-making (Lee et al., 2023, Yang et al., 2024).

The literature therefore uses the term in at least two closely related senses. First, DPT can mean a specific supervised pretraining recipe for in-context reinforcement learning, with a fixed model used at inference by prompt augmentation alone (Lee et al., 2023). Second, DPT can mean an architectural or systems-level category: any Decision Transformer policy whose core sequence model has been pre-trained before task specialization, including GPT-initialized controllers, prompt-conditioned multi-task policies, and domain-specific zero-shot control systems (Zhang et al., 2024, Berkes, 2024).

This multiplicity of usage is not merely terminological. It reflects a shift from viewing transformers as task-specific offline RL models to viewing them as reusable sequence models for decision processes. A plausible implication is that DPT functions as an umbrella concept spanning supervised in-context learning, transfer from foundation LLMs, and prompt-based task adaptation.

2. Sequence representations and architectural forms

A recurrent architectural pattern is the encoding of decision problems as causal token sequences. In return-conditioned settings, an offline RL trajectory is written as

$(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$

with $\hat r_t=\sum_{t'=t}^T r_{t'}$ , and each modality is projected into a shared embedding space before interleaving and positional encoding (Rietz et al., 7 Feb 2025). In partially observable continuous control, the corresponding input is a truncated context

$[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$

where separate MLPs embed return-to-go, observation, and action tokens into a shared $d$ -dimensional space prior to processing by a GPT-2 backbone (Zhang et al., 2024).

The in-context RL formulation uses a different but related encoding. For an in-context dataset $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ and query state $\bar s$ , the sequence is

$X=\bigl[(\bar s,0\dots 0),\xi_1,\xi_2,\dots,\xi_n\bigr],$

where each $\xi_j\in\mathbb{R}^{2d_s+d_a+1}$ concatenates state, action, next state, and reward. That formulation omits positional encoding on the dataset tokens to respect the set-invariance of $(R_t,s_t,a_t)$ 0, while still using a causal transformer to produce a distribution over the next optimal action (Lee et al., 2023).

Across variants, the backbone remains a masked self-attention stack with feed-forward sublayers and residual connections. PLM-initialized systems retain most of the original transformer internals. LPDT freezes the pre-trained causal LLM and only replaces the input/output embedding layers while adding LoRA adaptation matrices (Yang et al., 2024). The control-oriented DPT of Zhang et al. initializes all transformer parameters from GPT-2-117M and inserts low-rank adapters into attention and MLP weights while leaving the original weights frozen (Zhang et al., 2024). HVAC-DPT similarly flattens state-action trajectories into one long autoregressive sequence and predicts actions from state-action prefixes, but frames the task as multi-zone HVAC control with per-zone continuous actions (Berkes, 2024).

These choices indicate that DPT is not tied to a single tokenization scheme. What is stable is the reduction of sequential decision-making to conditional next-action prediction, whether the conditioning signal is return-to-go, a prompt, a future latent, or an in-context dataset of transitions.

3. Pretraining objectives and adaptation interfaces

The canonical supervised DPT objective asks the model to predict the optimal action from progressively larger prefixes of an in-context dataset. If $(R_t,s_t,a_t)$ 1 is sampled from the optimal policy for task $(R_t,s_t,a_t)$ 2, the loss is

$(R_t,s_t,a_t)$ 3

where $(R_t,s_t,a_t)$ 4 is the prefix of length $(R_t,s_t,a_t)$ 5. Training proceeds over many tasks, and the learned model is then fixed for future in-context use (Lee et al., 2023).

A second line of work uses Decision Transformer pretraining on offline RL trajectories. In the small-transformer DPT of the prompt-tuning bandit paper, pretraining minimizes

$(R_t,s_t,a_t)$ 6

with no value network or actor-critic machinery (Rietz et al., 7 Feb 2025). MADT extends the offline pretraining logic to multi-agent RL by treating each agent’s interaction trace as one long token sequence and training with behavior cloning via cross-entropy on the next action, masking illegal actions to zero probability and using no explicit Q-value penalty or divergence regularizer beyond standard weight decay (Meng et al., 2021).

A third regime is unsupervised or weakly supervised pretraining. Pretrained Decision Transformer (PDT) uses reward-free offline trajectories and conditions action prediction on a latent future embedding $(R_t,s_t,a_t)$ 7 produced from a future segment. Its total pretraining loss is

$(R_t,s_t,a_t)$ 8

combining future-conditioned behavior cloning, entropy, KL regularization of the future encoder, and prior matching through $(R_t,s_t,a_t)$ 9 (Xie et al., 2023). The stated goal is to utilize generalized future conditioning to enable efficient unsupervised pretraining from reward-free and sub-optimal offline data.

Parameter-efficient specialization is another prominent interface. LPDT uses DistilGPT2 as a frozen backbone and fine-tunes only the new RL input/output projections together with LoRA adapters. Its training objective is

$(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 0

where $(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 1 is either a classifier-based prompt regularizer or an InfoNCE objective that makes prompt representations more discriminative across tasks (Yang et al., 2024). In partially observable continuous control, the analogous GPT-2-based DPT optimizes mean squared error on actions while updating only LoRA parameters (Zhang et al., 2024).

Prompt-based post-pretraining adaptation introduces yet another interface. Prompting Decision Transformer prepends trajectory prompts built from task-specific demonstrations, and the bandit-based extension treats prompt construction as a contextual bandit over candidate trajectory segments. The frozen DPT backbone remains unchanged, while only small slot-wise reward models $(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 2 are updated online to improve prompt quality (Rietz et al., 7 Feb 2025).

Taken together, DPT research has converged on three adaptation mechanisms: gradient-based fine-tuning of a pre-trained backbone, prompt optimization over a frozen backbone, and pure in-context inference with no parameter updates.

4. Theoretical interpretation and internal mechanisms

The strongest theoretical claim in the DPT literature is that supervised in-context pretraining can implement Bayesian posterior sampling. Under realizability and compliance, the joint distribution over state-action trajectories generated by DPT in-context equals that of exact posterior sampling, and the resulting policy inherits Bayesian and frequentist regret guarantees (Lee et al., 2023). For a finite-horizon MDP with $(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 3 states, $(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 4 actions, horizon $(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 5, and $(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 6 episodes, the stated Bayesian regret bound is

$(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 7

In a $(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 8-dimensional linear bandit, DPT recovers $(\hat r_0,s_0,a_0),(\hat r_1,s_1,a_1),\dots,(\hat r_T,s_T,a_T),$ 9 regret, improving over the $\hat r_t=\sum_{t'=t}^T r_{t'}$ 0 bound of its own data generator (Lee et al., 2023).

This posterior-sampling interpretation is extended in the scalable multi-domain setting of Vintix II. There, DPT uses a transformer plus a flow-based policy head and is trained with a rectified-flow matching objective rather than a standard Gaussian or cross-entropy head. The paper states that Flow Matching is a natural training choice that preserves the interpretation of DPT as Bayesian posterior sampling, while also supporting expressive multi-modal continuous actions (Polubarov et al., 6 Apr 2026). The context consists of a query observation and a randomly permuted set of $\hat r_t=\sum_{t'=t}^T r_{t'}$ 1 context tokens, with no positional encodings, again emphasizing in-context inference over recurrent state estimation.

A separate mechanistic analysis concerns what is transferred when pre-trained LLMs are reused for offline RL. “Unveiling Markov Heads in Pretrained LLMs for Offline Reinforcement Learning” identifies Markov head as a crucial component in the attention heads of PLMs. Such a head leads to extreme attention on the last-input token and performs well only in short-term environments (Zhao et al., 2024). The paper further states that this extreme attention cannot be changed by re-training embedding layer or fine-tuning, and proves preservation under random embeddings and robustness under bounded fine-tuning updates. The proposed GPT2-DTMA augments a pretrained DT with Mixture of Attention (MoA), treating head outputs as experts weighted by an input-dependent gate, thereby accommodating diverse attention requirements during fine-tuning (Zhao et al., 2024).

These results jointly sharpen the understanding of DPT. The posterior-sampling analyses explain why in-context DPT can display exploration online and conservatism offline. The Markov-head analysis explains why PLM-initialized decision transformers may show asymmetric gains across short- and long-horizon environments. This suggests that “pretraining” in DPT is not a monolithic benefit: it can import either useful inductive structure or horizon-specific biases.

5. Empirical domains and reported performance

DPT-style methods have been evaluated across bandits, MDPs, offline multi-agent RL, partially observable control, robotics-like benchmarks, autonomous driving, traffic coordination, and building control. The reported results are heterogeneous in protocol but consistent in showing that pretrained sequence models can generalize across tasks or adapt in context.

Instantiation	Setting	Reported result
DPT (Lee et al., 2023)	Bandits and MDPs	Matches Thompson Sampling offline, cumulative regret on par with UCB and TS online
MADT (Meng et al., 2021)	SMAC offline/online MARL	Outperforms offline RL baselines; improves sample efficiency in PPO fine-tuning
GPT-initialized DPT (Zhang et al., 2024)	Partially observable continuous control	Zero-shot positive returns on all tasks; 10-shot outperforms expert PPO across the board
LPDT (Yang et al., 2024)	Few-shot prompt meta-RL	DistilGPT2 initialization improves Prompt-DT on unseen tasks
HVAC-DPT (Berkes, 2024)	Multi-zone HVAC control	45.62 % reduction in HVAC energy use versus baseline
Vintix II DPT (Polubarov et al., 6 Apr 2026)	209-task multi-domain ICRL	Clear gains in generalization to the held-out test set

In the original supervised DPT study, the model’s suboptimality in Gaussian bandits matches Thompson Sampling and it achieves cumulative regret on par with UCB and Thompson Sampling online. In the Dark Room environment, given a random dataset with $\hat r_t=\sum_{t'=t}^T r_{t'}$ 2, DPT reaches $\hat r_t=\sum_{t'=t}^T r_{t'}$ 3 return, versus $\hat r_t=\sum_{t'=t}^T r_{t'}$ 4 for Emp and $\hat r_t=\sum_{t'=t}^T r_{t'}$ 5 for AD, and within 40 episodes it solves on par or better than Algorithm Distillation and $\hat r_t=\sum_{t'=t}^T r_{t'}$ 6 while single-task PPO fails (Lee et al., 2023). The same work reports robustness to unseen reward variances, Bernoulli bandits, unseen goal locations, action-space permutations, and the ability to “stitch” a third unseen optimal trajectory from demonstrations of two related tasks (Lee et al., 2023).

In MARL, MADT is trained on the first offline SMAC dataset with diverse quality levels. On 2s3z-good data, BCQ-MA achieves $\hat r_t=\sum_{t'=t}^T r_{t'}$ 7 mean return while MADT converges to $\hat r_t=\sum_{t'=t}^T r_{t'}$ 8, and in online fine-tuning on 3s5z, reaching 20% win requires MAPPO $\hat r_t=\sum_{t'=t}^T r_{t'}$ 9 steps versus MADT-PPO $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 0, described as an approximately $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 1 speed-up (Meng et al., 2021). The same paper reports universal-MADT average win $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 2 versus from-scratch MAPPO $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 3 in a few-shot setting, and zero-shot performance of about 50% win on the held-out map “3s_vs_4z” (Meng et al., 2021).

In partially observable continuous control, GPT-2-initialized DPT is evaluated on five Controlgym tasks. In single-task training it exceeds expert PPO on he1, ac4, cm3, and Burgers, while remaining near PPO on CDR; in multi-task evaluation, zero-shot DPT yields positive returns on all listed in-distribution and out-of-distribution tasks, and with 10 demonstrations it outperforms expert PPO across the board (Zhang et al., 2024). The paper interprets this as evidence that DT can capture parameter-agnostic structures intrinsic to control tasks.

LPDT reports few-shot prompt improvements on unseen MuJoCo meta-RL and Meta-World ML1 tasks. For example, on Cheetah-dir, Prompt-DT scores $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 4 while LPDT-NCE scores $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 5; on Cheetah-vel, Prompt-DT scores $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 6 while LPDT-cls scores $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 7 (Yang et al., 2024). The paper also reports that with only 10% of the dataset, LPDT still outperforms full-data Prompt-DT on most tasks (Yang et al., 2024).

The prompt-tuning bandit work addresses a different weakness: uniform random trajectory prompts. On Sparse 2D, PDT without tuning gives $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 8 for $[R_{t-K+1},o_{t-K+1},a_{t-K+1},\dots,R_t,o_t],$ 9 and $d$ 0 for $d$ 1, while $d$ 2-greedy raw and UCB raw achieve $d$ 3 and $d$ 4, respectively; standard DT without prompts gives $d$ 5 (Rietz et al., 7 Feb 2025). On Half-Cheetah, $d$ 6-greedy CMAB $d$ 7 improves from $d$ 8 to $d$ 9 (Rietz et al., 7 Feb 2025).

Beyond benchmark RL, DPT-style models have been used in physical control domains. The GPT-based Decision Transformer for multi-vehicle coordination at unsignalized intersections is trained on $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 0 reservation-based trajectories plus mixed collision data and is reported to outperform the training data in terms of total travel time, generalize to continuous 300 s traffic, 2% velocity noise, different vehicle numbers, and a 3-way intersection, and in 91 collision-free test cases match or beat AIM in approximately 15 cases (Lee et al., 2024). HVAC-DPT frames multi-zone building control as in-context RL over state-action histories generated by diverse PPO agents and reports a 45.62% reduction in HVAC energy use versus a fixed-opening baseline, remaining within 5.78% of the bespoke “Expert” controller while SARL and MARL perform 74% and 70% worse than HVAC-DPT, respectively (Berkes, 2024).

At larger scale, Vintix II trains DPT across 209 training tasks in 10 domains totaling 709 M timesteps, with 46 held-out tasks for test. On unseen tasks in offline inference, DPT achieves 102% of demonstrator on MetaDrive, 78% on CityLearn, 92% on SinerGym, and 100% on ControlGym; against scaled-up AD, improvements include $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 1 on Bi-DexHands, $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 2 on MuJoCo param-shift, and $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 3 on Meta-World ML45, and the paper reports DPT as $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 4 ahead of REGENT on Meta-World ML45 (Polubarov et al., 6 Apr 2026).

6. Limitations, misconceptions, and open directions

Several limitations recur across the literature. MADT notes that pretraining is pure behavior cloning and therefore inherits suboptimal biases in the offline dataset; pure offline MADT cannot improve after loading because it simply clones old actions and has no drive to chase higher reward (Meng et al., 2021). The same paper reports that including reward-to-go degrades online fine-tuning because the offline RTG distribution mismatch misguides rollouts, and recommends using state/observation only for that setting (Meng et al., 2021). This directly counters a common misconception that more conditioning signals are always beneficial.

Future-conditioned unsupervised pretraining introduces its own trade-offs. PDT reports that the future-KL weight $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 5 must be tuned per dataset: too large $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 6 collapses $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 7 to zero, while too small $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 8 lets the policy over-rely on $D=\{(s_j,a_j,s_j',r_j)\}_{j=1}^n\}$ 9 and ignore history. The method also encodes only the next- $\bar s$ 0 steps, which may miss very long-horizon structure, and the paper reports pretraining of 2–3 h per task on a 3090 GPU and finetuning of 6–8 h for 200 K steps (Xie et al., 2023).

PLM-initialized DPTs are constrained by transfer biases. The Markov-head study shows that some GPT-2 attention heads are so diagonal-dominated that they almost always attend to the most recent token; this benefits short-term environments but degrades long-term ones, and the effect cannot be removed by re-training embedding layers or standard fine-tuning (Zhao et al., 2024). The proposed MoA remedy narrows the performance gap in long-term environments while keeping comparable performance in short-term settings, but it also implies that naïve language-model transfer can encode an undesirable inductive prior for planning horizon (Zhao et al., 2024).

Scalable in-context DPT is still subject to data and interface limitations. Vintix II states that training still uses less than 1 token per parameter even though scaling laws suggest approximately 20 tokens per parameter are optimal; zero-demo exploration remains weaker than few-demo prompts; and the current architecture requires grouping by input/output dimensions, so it cannot yet handle entirely unseen state/action formats (Polubarov et al., 6 Apr 2026). In control applications, the partially observable DPT of Zhang et al. identifies context truncation, lack of explicit absolute-time embeddings, and fixed action heads for varying action dimensionality as open issues (Zhang et al., 2024). In traffic coordination, proposed future work includes safety filters, mixed-autonomy operation, and richer dynamics such as lane changes and pedestrians (Lee et al., 2024).

The term itself can also be misleading. “DPT” names both a specific in-context RL algorithm and a broader class of pretrained Decision Transformer systems; “PDT” denotes “Pretrained Decision Transformer” in one paper and “Prompting Decision Transformer” in another (Xie et al., 2023, Rietz et al., 7 Feb 2025). The underlying research program, however, is consistent: pretrained sequence models are being used to amortize decision-making structure across tasks, with deployment-time adaptation achieved by context, prompts, latent futures, or lightweight adapters rather than task-specific optimization from scratch.