Papers
Topics
Authors
Recent
2000 character limit reached

Q-learning Decision Transformer (QDT)

Updated 23 December 2025
  • QDT is a hybrid reinforcement learning method that merges Q-learning with Decision Transformers to overcome limitations in stitching optimal behaviors from suboptimal data.
  • It employs techniques like return-to-go relabeling, encoder-decoder Q-decomposition, and action gradient optimization to improve policy performance in offline RL settings.
  • Empirical results show QDT outperforms standard Decision Transformers and conventional value-based methods on benchmarks like MuJoCo and Atari while offering improved interpretability.

The Q-learning Decision Transformer (QDT) refers to a class of approaches that systematically integrate Q-learning algorithms or Q-function optimization with the Decision Transformer (DT) framework in offline or deep reinforcement learning (RL). QDT methodologies address critical limitations of plain DT—specifically, their inability to stitch together optimal behaviors from suboptimal data and their lack of value-based or dynamic programming signals—by merging the supervised, return-conditioned sequence modeling of DT with Q-learning principles. QDT approaches include value relabeling, joint Q-functional architectures, and gradient-based action adjustments, each addressing particular performance and generalization challenges in offline RL.

1. Motivation: Limitations of Decision Transformers and the Role of Q-Learning

The Decision Transformer architecture frames RL policy learning as a conditional sequence modeling problem, where a Transformer is trained to predict the next action given a context of state, previous actions, and a user-specified return-to-go (RTG) signal. While DT stabilizes training by avoiding bootstrapping, it is fundamentally limited in two ways:

  • Trajectory-level extrapolation ("stitching"): DT cannot assemble high-return trajectories from short, optimal subsequences dispersed across different suboptimal trajectories. Since DT operates purely via imitation, it can only mimic action sequences present in the data matching the conditioned RTG, lacking an explicit mechanism to combine or propagate rewards across trajectories (Yamagata et al., 2022).
  • State-level extrapolation (action extrapolation): DT cannot propose actions not seen in the training data, even if out-of-distribution actions would result in higher Q-values. It solely maximizes action likelihoods within the observed support (Lin et al., 6 Oct 2025).

Traditional Q-learning and dynamic programming methods can combine subsequences by propagating value estimates via Bellman backups, enabling superior stitching. However, Q-learning often suffers from unstable learning and distributional shift when applied with function approximation in offline settings.

QDT architectures address these issues by explicitly introducing Q-function estimates or value-based optimization routines into the Decision Transformer pipeline.

2. Algorithmic Variants and Architectural Approaches

Three principal QDT architectures are prevalent in the literature, reflecting distinct integration points of Q-learning and Transformer sequence modeling:

(a) Q-learning Decision Transformer via Dynamic Programming Relabeling

The Q-learning Decision Transformer framework of Xu et al. (Yamagata et al., 2022) proceeds through three stages:

  1. Offline Q-learning: A conservative Q-function Q^(s,a)\hat Q(s, a) is first learned from the dataset using a conservative objective (e.g., CQL). The state value V^(s)\hat V(s) is either computed as an expectation under the policy or via the greedy policy.
  2. Return-to-Go Relabeling: All trajectories in the dataset are traversed backward. At each step tt, the RTG RtR_t is replaced by a Bellman-relabeled value:

Rt1rt1+γmax(Rt,V^(st))R_{t-1} \leftarrow r_{t-1} + \gamma \max \big( R_t, \hat V(s_t) \big)

Relabeled prefixes are recursively constructed for Transformer contexts.

  1. Training the Transformer: A standard causal Transformer is trained on the revised dataset. Given context (G^tK+1:t,stK+1:t,atK+1:t1)(\hat G_{t-K+1:t}, s_{t-K+1:t}, a_{t-K+1:t-1}), the next action ata_t is predicted using cross-entropy loss.

This procedure enables the DT policy to inherit Q-learning's stitching capacity by training on RTG sequences closer to the optimal value function, while still leveraging DT’s robust sequence modeling (Yamagata et al., 2022).

(b) Action Q-Transformer: Encoder-Decoder Value Decomposition

In online deep RL, the Action Q-Transformer (AQT) framework incorporates a Transformer-based architectural decomposition of the Q-function (Itaya et al., 2023):

  • Architecture: A CNN feature extractor generates spatial tokens, which are passed through a Transformer encoder. The encoder's output represents a state embedding for computing the value function V(s)V(s). Each action’s one-hot encoding is projected via a learned linear layer to yield action queries, which are processed in the decoder with cross-attention to the encoded state.
  • Dueling Q-decomposition:

Q(s,a)=V(s)+Adv(s,a)1nai=1naAdv(s,ai)Q(s, a) = V(s) + \text{Adv}(s, a) - \frac{1}{n_a} \sum_{i=1}^{n_a} \text{Adv}(s, a_i)

The value branch reads the encoder summary, and the advantage branch reads the decoder output for each action query.

  • Training: The model is trained with the Rainbow distributional Bellman loss, augmented by a target loss to stabilize Transformer training. Standard deep RL infrastructure such as prioritized replay and target networks is used.

AQT combines spatial attention mechanisms with value-based RL and provides explicit interpretability via attention visualization (Itaya et al., 2023).

(c) Action Gradient: Q-Guided Local Action Optimization

A more recent line leverages a Q-value critic, often trained via Implicit Q-Learning (IQL), to refine the DT policy’s actions at inference time using the action gradient (AG) (Lin et al., 6 Oct 2025):

  • Action Gradient Update: For a DT-proposed action a0=πθ(context)a^0 = \pi_\theta(\text{context}), gradient ascent steps are performed:

ai+1=ai+αaiQϕ(st,ai)a^{i+1} = a^i + \alpha \nabla_{a^i} Q_\phi(s_t, a^i)

After nn steps, the action among {a0,,an}\{a^0, \ldots, a^n\} maximizing QϕQ_\phi is chosen.

  • Modularity: AG operates entirely at inference, with Q-learning signals injected without altering the Transformer’s training objective.

This approach provides DT with state-level extrapolation while maintaining the stability of sequence modeling (Lin et al., 6 Oct 2025).

3. Mathematical Formalism and Pseudocode

  1. Q-function Learning (CQL loss):

LCQL=E(s,a,r,s)D[(r+γmaxaQ^(s,a)Q^(s,a))2]+penalty\mathcal{L}_{\text{CQL}} = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \left[ (r + \gamma \max_{a'} \hat Q(s', a') - \hat Q(s, a))^2 \right] + \text{penalty}

  1. Bellman RTG Relabeling:

Rt1rt1+γmax(Rt,V^(st))R_{t-1} \leftarrow r_{t-1} + \gamma \max (R_t, \hat V(s_t))

  1. Transformer Training Loss:

L(θ)=tk=0K1logπθ(atkG^tk,stk,)\mathcal{L}(\theta) = - \sum_t \sum_{k=0}^{K-1} \log \pi_\theta(a_{t-k} \mid \hat G_{t-k}, s_{t-k}, \ldots)

  • Encoder: Multi-head self-attention on spatial tokens (with positional encodings).
  • Action Query: qi=Wqeiq_i = W_q e_i, eie_i one-hot action vector.
  • Decoder: Each query attends to encoder output, yielding per-action embedding.
  • Q-Aggregation:

Q(s,a)=V(s)+Adv(s,a)1nai=1naAdv(s,ai)Q(s, a) = V(s) + \text{Adv}(s, a) - \frac{1}{n_a} \sum_{i=1}^{n_a} \text{Adv}(s, a_i)

  1. Critic Learning:

Qϕ,Vψ via IQL or CQLQ_\phi, V_\psi \text{ via IQL or CQL}

  1. Gradient Ascent:

ai+1=ai+αaiQϕ(st,ai)a^{i+1} = a^i + \alpha \nabla_{a^i} Q_\phi(s_t, a^i)

Select a^t=argmaxi=0nQϕ(st,ai)\hat{a}_t = \arg\max_{i=0 \ldots n} Q_\phi(s_t, a^i).

4. Empirical Evaluation and Quantitative Results

  • In offline RL, QDT with RTG relabeling outperforms both standard DT (which fails to stitch) and vanilla CQL (which can be unstable) in settings requiring the assembly of high-reward trajectories from suboptimal parts (Yamagata et al., 2022). For example, in toy gridworld: CQL ≈ 40.0, DT ≈ 16.0, QDT ≈ 42.0; in Maze2D (sparse): DT fails, CQL and QDT succeed; in MuJoCo with delayed rewards, DT and QDT succeed, CQL fails.
  • The Action Q-Transformer achieves higher normalized cumulative reward than Rainbow in Atari 2600 tasks such as Breakout (AQT: 130.5 vs. Rainbow: 100), with improved interpretability via attention mechanism visualization (Itaya et al., 2023).
  • Action Gradient (AG) QDT variants set new DT-based state-of-the-art on D4RL benchmarks, notably in hopper-medium (RF+AG: 98.9 vs. best baselines <97) and walker2d-medium (RF+AG: 86.0) (Lin et al., 6 Oct 2025). AG improved vanilla DT in environments where state-level extrapolation is critical.

5. Key Insights, Benefits, and Limitations

  • Stitching Ability: QDTs employing Q-learning-based supervision, either through label relabeling or critic-guided inference, are able to synthesize optimal behavior from suboptimal trajectories—an ability lacking in unmodified DTs (Yamagata et al., 2022).
  • Modular Enhancement: Action Gradient approaches use a Q-function for post-hoc action refinement without destabilizing the core DT training, preserving modularity (Lin et al., 6 Oct 2025).
  • Interpretability: Encoder-decoder architectures such as AQT provide fine-grained, per-action rationales via attention visualization, facilitating examination of agent focus with respect to V(s)V(s) and Adv(s,a)\text{Adv}(s, a) (Itaya et al., 2023).
  • Stability and Hyperparameters: QDTs depend on reliable, often conservative Q-functions. Inaccurate value estimation can degrade relabeling or action refinement. Integration introduces new hyperparameters (e.g., α\alpha for Action Gradient; CQL penalties for relabeling) requiring tuning for stable performance (Yamagata et al., 2022, Lin et al., 6 Oct 2025).

6. Open Problems and Future Research Directions

  • Robustness of Q-functions: Improving the reliability of Q-values for RTG relabeling or gradient refinement, especially in environments with distributional shift or high-dimensional observations.
  • Automatic Hyperparameter Selection: Procedures for adapting learning rates, penalties, adventure clipping, or AG step sizes dynamically during training or inference.
  • Extension to Rich Observations and Language: Scalability of QDT methods to vision-based RL or tasks involving language instructions and compositional generalization (Yamagata et al., 2022).
  • Uncertainty-aware Relabeling: Selectively masking or adapting relabeling based on statistical uncertainty in Q-values or model ensembles.
  • Hybridization with Hierarchical and Latent Methods: Integration of QDT principles with hierarchical token prediction, latent variable modeling, or distributional Q-learning to further enhance stitching and extrapolation (Lin et al., 6 Oct 2025).

7. Relation to Broader Q-learning-Transformer Literature

QDT encompasses a spectrum of methods unifying transformers and Q-learning for RL: from direct Q-value sequence predictors (Stein et al., 2020), encoder-decoder dueling decompositions (Itaya et al., 2023), RTG relabeling (Yamagata et al., 2022), to action-space refinement with an offline critic (Lin et al., 6 Oct 2025). These methods have demonstrated that, with appropriate stabilization and modular Q-function integration, transformer-based RL agents can match or surpass traditional DQN and contemporary value-based baselines in both online and offline settings, while enabling interpretability and higher compositional capacity.

A plausible implication is that future RL systems will increasingly rely on hybrid sequence modeling with value-based refinement, leveraging the compositional power and recall ability of transformers matched with classical dynamic programming’s propagation and extrapolation strengths.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Q-learning Decision Transformer (QDT).