Transformer-based Policies

Updated 18 December 2025

Transformer-based policies are neural architectures that use self-attention to fuse temporally and spatially distributed observations, enabling robust decision-making in complex tasks.
They are applied in reinforcement learning, imitation learning, and control, offering state-of-the-art sample efficiency, generalization, and robustness across various domains.
Architectural variants such as encoder-only, decoder-only, hybrid, graph-transformer, and diffusion-transformer designs provide tailored solutions for specific sequential decision-making challenges.

Transformer-based policies are neural policy architectures in sequential decision-making tasks (reinforcement learning, imitation learning, and control), where policy computation and/or credit assignment are mediated via self-attention mechanisms. By replacing or augmenting conventional fully connected, convolutional, or recurrent neural network policies, transformers offer increased capacity to fuse temporally and spatially distributed observations, handle variable-length inputs, and solve complex control tasks with rich partial observability, multi-agent structure, or multi-objective optimization. Recent research has demonstrated that transformer-based policies yield state-of-the-art sample efficiency, generalization, and policy robustness across robot manipulation, locomotion, flow control, multi-agent modeling, and model-based planning contexts.

1. Formal Definition and General Principles

A transformer-based policy is typically specified as a parameterized function $\pi_\theta: O_{1:T} \rightarrow A_T$ , mapping a sequence of $T$ observations $O_{1:T}$ to actions $A_T$ . The sequence is embedded into token representations, which are processed by a stack of transformer layers implementing multi-head self-attention:

$\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V$

For RL, the policy can be integrated into actor-critic (e.g. PPO, SAC) or value-based setups; for imitation learning, it is often optimized by log-likelihood or diffusion modeling objectives.

Critically, transformer-based policies excel at modeling:

Distributed spatio-temporal interactions: Equitable fusion of sensor sequences, agent histories, spatial distributions, or multimodal signals.
Long-range dependency: Memory over extended horizons in dynamical systems or multi-agent contexts.
Variable-structure and input length: Handling diverse morphologies (Luo et al., 21 May 2025), prediction horizons (Wu et al., 9 Sep 2025), or multi-task settings (Lawson et al., 2023).

2. Architectural Variants and Components

Several transformer policy designs are established in recent literature:

Encoder-only architectures: Used for explicit sequence-to-action inference in control (TransMPC (Wu et al., 9 Sep 2025)), agent modeling (TransAM (Wallace et al., 4 Aug 2025)), or multi-modal manipulation (Tenma (Davies et al., 15 Sep 2025)).
Decoder-only (GPT-style) architectures: For autoregressive RL/IL and offline sequence modeling (Decision Transformer (Lawson et al., 2023), Agentic Transformer (Liu et al., 2023)).
Hybrid/Interleaved architectures: Combining Transformer and ResNet blocks for strategic games (ResTNet (Wu et al., 7 Oct 2024)) or gated Transformer-XL for long-memory RL (GTrXL (Jain et al., 14 Nov 2025)).
Graph-transformer hybrids: GCNT (Luo et al., 21 May 2025), Gformers (Duan et al., 4 Mar 2025) employ GCNs or structural-aware attention for morphology-agnostic control or permutation-equivariant precoding.

Notable module-level designs:

Gated embedding mechanisms for feature alignment (He et al., 2023).
Cross-embodiment normalizers and slot masking for robot heterogeneity (Davies et al., 15 Sep 2025).
Mixture-of-experts self-attention (MoDE (Reuss et al., 17 Dec 2024)), noise-conditioned token routing in diffusion transformers.

3. Training Frameworks and Optimization

Transformers can be optimized under various policy learning paradigms:

On-policy RL: PPO (Jain et al., 14 Nov 2025, Liu et al., 11 Jun 2025, Sarkar et al., 17 Apr 2024), actor-critic with advantage estimation.
Off-policy RL: SAC as in tactile grasping (Puang et al., 30 Jul 2024), TD3 or DDPG for morphology-agnostic control (Luo et al., 21 May 2025).
Offline RL/Imitation Learning: Sequence modeling (DT (Lawson et al., 2023)), agentic methods with hindsight experience relabeling (AT (Liu et al., 2023)), and diffusion-based trajectory branch generation (Liu et al., 18 Nov 2024).
Direct cost optimization: Model predictive control via gradient descent through transformer and system dynamics (TransMPC (Wu et al., 9 Sep 2025)).
Multi-objective optimization: Scalarized reward functions with trade-off parameters, e.g., grasp stability vs. force (Puang et al., 30 Jul 2024).

Recent theoretical advances (GPG Theorem (Mao et al., 11 Dec 2025)) generalize policy-gradient credit assignment for autoregressive transformer policies, bridging token-level and macro-action segmentation, with practical advantages for stable and efficient policy optimization in large models.

4. Applications and Empirical Achievements

Robotics and Control

Dexterous manipulation: Tactile-transformer policies outperform CNN baselines and achieve zero-shot sim-to-real transfer in stable grasping (Puang et al., 30 Jul 2024).
Crowd navigation: Spatio-temporal transformers with gated embedding enhance human-robot interaction feature fusion (He et al., 2023).
Universal locomotion: GCNT achieves resilient control and zero-shot morphology generalization (Luo et al., 21 May 2025).
Wave energy conversion: STrXL with gated residuals boosts energy efficiency and stress reduction over FCN/LSTM controllers (Sarkar et al., 17 Apr 2024).
Aerodynamic lift regulation: Transformer policies trained via PPO generalize to long gust sequences and exploit added-mass actuation (Liu et al., 11 Jun 2025).

Weight-merged multi-task policies: Decision Transformers merged via Fisher averaging retain high performance, bypassing centralized training (Lawson et al., 2023).
Cross-embodiment manipulation: Tenma’s diffusion-transformer with slot normalization yields robust manipulation across object/scene/embodiment shifts (Davies et al., 15 Sep 2025).
Diffusion-policy scaling: MoDE achieves state-of-the-art multitask scores on CALVIN and LIBERO with 90% FLOPS reduction via sparse expert routing (Reuss et al., 17 Dec 2024).

Strategic Reasoning

Board games: Interleaved residual-transformer chains (ResTNet) dramatically improve global pattern recognition and adversarial robustness in Go and Hex (Wu et al., 7 Oct 2024).
Multi-agent modeling: TransAM leverages local transformer encoding for agent belief formation and improves performance in cooperative, competitive, and mixed tasks (Wallace et al., 4 Aug 2025).

Communication and Model-based Planning

Precoding in MU-MIMO systems: Graph-transformers exploit permutation-equivariance for low-complexity, size-generalizable policies (Duan et al., 4 Mar 2025).
Explicit MPC: TransMPC’s transformer encoder solves for variable-horizon control actions in one pass, with constant inference time, outperforming RNN/MLP baselines (Wu et al., 9 Sep 2025).

5. Recent Theoretical Advances

The Generalized Policy Gradient (GPG) Theorem (Mao et al., 11 Dec 2025) unifies token-level policy gradients and group/segment-level optimization for transformers. This formalism accommodates macro-action segmentation, autoregressive generation, and variable-length credit assignment, crucial in RL with LLMs and structured decision tasks.

Diffusion transformer policies are further optimized via mixture-of-expert denoisers, noise-conditioned routing, and RL-driven acceleration policies (RAPID³ (Zhao et al., 26 Sep 2025)), which leverage small policy heads and group-based rewards for per-instance efficiency without generator fine-tuning.

6. Limitations, Design Guidelines, and Future Directions

Current transformer-based policies demand careful tuning of model depth, embedding dimension, tokenization granularity, and alignment between structural properties of the task and architecture (e.g., permutation equivariance in communications (Duan et al., 4 Mar 2025), morphology encoding (Luo et al., 21 May 2025)). Scaling rules suggest that increased capacity generally increases sample efficiency and transfer, provided downstream tasks are sufficiently diverse.

Limitations and open fronts include:

Computational cost for very large models without parameter-efficient scaling (Reuss et al., 17 Dec 2024).
Requirement of generative models or simulators for counterfactual transfer (Boustati et al., 2021).
RL instabilities and hyperparameter sensitivity in acceleration-policy training (Zhao et al., 26 Sep 2025).
Generalization to 3D morphologies, real-world robotic platforms, and continuous environment perturbations (Luo et al., 21 May 2025, Boustati et al., 2021).

Emergent techniques such as adaptive segmentation in policy optimization (Mao et al., 11 Dec 2025), hybrid graph-transformer architectures, and multimodal fusion are promising directions for increased efficiency, robustness, and transferability.

7. Summary Table of Transformer Policy Types

Architecture	Key Feature	Representative Task
Encoder-only	Bidirectional SA, parallel output	Explicit MPC, multi-agent modeling
Decoder-only	Autoregressive, causal SA	Offline RL, agentic sequence modeling
Hybrid (Res-Trans)	Interleaved residual + transformer	Strategic games (Go, Hex)
Graph-transformer	Structure-aware, PE	Morphology-agnostic control, communications
Diffusion-transformer	Score-based denoising, MoE routing	Imitation learning, robust manipulation

Transformers now constitute a fundamental policy class, substantially advancing the capacity, generalization, and sample efficiency of decision-making systems. The integration of attention-driven computation, macro-structural modeling, and parameter-efficient scaling is shaping future directions in sequential control and agentic intelligence.