Papers
Topics
Authors
Recent
2000 character limit reached

Transformer-based Policies

Updated 18 December 2025
  • Transformer-based policies are neural architectures that use self-attention to fuse temporally and spatially distributed observations, enabling robust decision-making in complex tasks.
  • They are applied in reinforcement learning, imitation learning, and control, offering state-of-the-art sample efficiency, generalization, and robustness across various domains.
  • Architectural variants such as encoder-only, decoder-only, hybrid, graph-transformer, and diffusion-transformer designs provide tailored solutions for specific sequential decision-making challenges.

Transformer-based policies are neural policy architectures in sequential decision-making tasks (reinforcement learning, imitation learning, and control), where policy computation and/or credit assignment are mediated via self-attention mechanisms. By replacing or augmenting conventional fully connected, convolutional, or recurrent neural network policies, transformers offer increased capacity to fuse temporally and spatially distributed observations, handle variable-length inputs, and solve complex control tasks with rich partial observability, multi-agent structure, or multi-objective optimization. Recent research has demonstrated that transformer-based policies yield state-of-the-art sample efficiency, generalization, and policy robustness across robot manipulation, locomotion, flow control, multi-agent modeling, and model-based planning contexts.

1. Formal Definition and General Principles

A transformer-based policy is typically specified as a parameterized function πθ:O1:T→AT\pi_\theta: O_{1:T} \rightarrow A_T, mapping a sequence of TT observations O1:TO_{1:T} to actions ATA_T. The sequence is embedded into token representations, which are processed by a stack of transformer layers implementing multi-head self-attention:

Attention(Q,K,V)=softmax(QK⊤dk)V\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V

For RL, the policy can be integrated into actor-critic (e.g. PPO, SAC) or value-based setups; for imitation learning, it is often optimized by log-likelihood or diffusion modeling objectives.

Critically, transformer-based policies excel at modeling:

  • Distributed spatio-temporal interactions: Equitable fusion of sensor sequences, agent histories, spatial distributions, or multimodal signals.
  • Long-range dependency: Memory over extended horizons in dynamical systems or multi-agent contexts.
  • Variable-structure and input length: Handling diverse morphologies (Luo et al., 21 May 2025), prediction horizons (Wu et al., 9 Sep 2025), or multi-task settings (Lawson et al., 2023).

2. Architectural Variants and Components

Several transformer policy designs are established in recent literature:

Notable module-level designs:

3. Training Frameworks and Optimization

Transformers can be optimized under various policy learning paradigms:

Recent theoretical advances (GPG Theorem (Mao et al., 11 Dec 2025)) generalize policy-gradient credit assignment for autoregressive transformer policies, bridging token-level and macro-action segmentation, with practical advantages for stable and efficient policy optimization in large models.

4. Applications and Empirical Achievements

Robotics and Control

  • Dexterous manipulation: Tactile-transformer policies outperform CNN baselines and achieve zero-shot sim-to-real transfer in stable grasping (Puang et al., 30 Jul 2024).
  • Crowd navigation: Spatio-temporal transformers with gated embedding enhance human-robot interaction feature fusion (He et al., 2023).
  • Universal locomotion: GCNT achieves resilient control and zero-shot morphology generalization (Luo et al., 21 May 2025).
  • Wave energy conversion: STrXL with gated residuals boosts energy efficiency and stress reduction over FCN/LSTM controllers (Sarkar et al., 17 Apr 2024).
  • Aerodynamic lift regulation: Transformer policies trained via PPO generalize to long gust sequences and exploit added-mass actuation (Liu et al., 11 Jun 2025).

Multi-task and Multi-modal Learning

  • Weight-merged multi-task policies: Decision Transformers merged via Fisher averaging retain high performance, bypassing centralized training (Lawson et al., 2023).
  • Cross-embodiment manipulation: Tenma’s diffusion-transformer with slot normalization yields robust manipulation across object/scene/embodiment shifts (Davies et al., 15 Sep 2025).
  • Diffusion-policy scaling: MoDE achieves state-of-the-art multitask scores on CALVIN and LIBERO with 90% FLOPS reduction via sparse expert routing (Reuss et al., 17 Dec 2024).

Strategic Reasoning

  • Board games: Interleaved residual-transformer chains (ResTNet) dramatically improve global pattern recognition and adversarial robustness in Go and Hex (Wu et al., 7 Oct 2024).
  • Multi-agent modeling: TransAM leverages local transformer encoding for agent belief formation and improves performance in cooperative, competitive, and mixed tasks (Wallace et al., 4 Aug 2025).

Communication and Model-based Planning

  • Precoding in MU-MIMO systems: Graph-transformers exploit permutation-equivariance for low-complexity, size-generalizable policies (Duan et al., 4 Mar 2025).
  • Explicit MPC: TransMPC’s transformer encoder solves for variable-horizon control actions in one pass, with constant inference time, outperforming RNN/MLP baselines (Wu et al., 9 Sep 2025).

5. Recent Theoretical Advances

The Generalized Policy Gradient (GPG) Theorem (Mao et al., 11 Dec 2025) unifies token-level policy gradients and group/segment-level optimization for transformers. This formalism accommodates macro-action segmentation, autoregressive generation, and variable-length credit assignment, crucial in RL with LLMs and structured decision tasks.

Diffusion transformer policies are further optimized via mixture-of-expert denoisers, noise-conditioned routing, and RL-driven acceleration policies (RAPID³ (Zhao et al., 26 Sep 2025)), which leverage small policy heads and group-based rewards for per-instance efficiency without generator fine-tuning.

6. Limitations, Design Guidelines, and Future Directions

Current transformer-based policies demand careful tuning of model depth, embedding dimension, tokenization granularity, and alignment between structural properties of the task and architecture (e.g., permutation equivariance in communications (Duan et al., 4 Mar 2025), morphology encoding (Luo et al., 21 May 2025)). Scaling rules suggest that increased capacity generally increases sample efficiency and transfer, provided downstream tasks are sufficiently diverse.

Limitations and open fronts include:

Emergent techniques such as adaptive segmentation in policy optimization (Mao et al., 11 Dec 2025), hybrid graph-transformer architectures, and multimodal fusion are promising directions for increased efficiency, robustness, and transferability.

7. Summary Table of Transformer Policy Types

Architecture Key Feature Representative Task
Encoder-only Bidirectional SA, parallel output Explicit MPC, multi-agent modeling
Decoder-only Autoregressive, causal SA Offline RL, agentic sequence modeling
Hybrid (Res-Trans) Interleaved residual + transformer Strategic games (Go, Hex)
Graph-transformer Structure-aware, PE Morphology-agnostic control, communications
Diffusion-transformer Score-based denoising, MoE routing Imitation learning, robust manipulation

Transformers now constitute a fundamental policy class, substantially advancing the capacity, generalization, and sample efficiency of decision-making systems. The integration of attention-driven computation, macro-structural modeling, and parameter-efficient scaling is shaping future directions in sequential control and agentic intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Transformer-based Policies.