Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL

Published 8 Sep 2022 in cs.LG | (2209.03993v4)

Abstract: Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy from sub-optimal trajectories. This issue becomes particularly significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not have the same limitation; however, they suffer from unstable learning behaviours, especially when they rely on function approximation in an off-policy learning setting. In this paper, we propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT by leveraging the benefits of Dynamic Programming (Q-learning). It utilises the Dynamic Programming results to relabel the return-to-go in the training data to then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We empirically show these in both simple toy environments and the more complex D4RL benchmark, showing competitive performance gains.

Abstract PDF Upgrade to Chat

Citations (61)

View on Semantic Scholar

Summary

The paper introduces a hybrid architecture that combines Q-learning with a transformer to relabel return-to-go values and address sub-optimal trajectory stitching in offline RL.
It employs a three-step methodology: learning a value function via Q-learning, dynamically relabeling data with conservative Q-learning, and training the Decision Transformer on reconstituted data.
Empirical evaluations show significant improvements in maze navigation and delayed reward tasks, highlighting QDT’s potential for robust offline reinforcement learning.

Overview of Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modeling in Offline Reinforcement Learning

The paper "Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL" introduces an innovative architecture that integrates components of the Decision Transformer (DT) framework and Q-learning, developed to address the specific challenges in offline reinforcement learning (RL). The DT architecture, known for combining a transformer-based model with a conditional policy, encounters difficulties in learning optimal policies when working exclusively with trajectories that are inherently sub-optimal. In contrast, Dynamic Programming approaches such as Q-learning excel in extracting optimal policies via backward propagation of the value function but grapple with instability issues in the context of function approximation.

To ameliorate these limitations, the proposed Q-learning Decision Transformer (QDT) algorithm capitalizes on the strengths of both Q-learning and the transformer architecture. QDT applies the principles of Dynamic Programming by utilizing Q-learning to relabel training data, specifically the return-to-go (RTG) values, with the values learned through Q-learning. This dynamic relabeling is proposed as a means to address the data inconsistencies DT faces in tasks necessitating trajectory stitching, where optimal policies are derived from piecing together segments of different sub-optimal trajectories.

Technical Contributions and Results

QDT refines its approach in three sequential steps:

Learning of a value function through Q-learning.
Relabeling the RTG in offline data using the conservative Q-learning framework (CQL), which represents the optimistic lower bound of the true values.
Training the Decision Transformer using the reconstituted data.

The paper details empirical evaluations across various control environments, demonstrating the robustness of QDT in handling both sub-optimal trajectory scenarios and complex task settings. QDT showcases significant improvements over DT in maze environments by utilizing Q-learning's capacity for sequence stitching. Moreover, in delayed reward scenarios where Q-learning algorithms typically struggle, QDT's connection to learned RTG offers substantial performance improvements, closing the gap with standard Q-learning results.

This hybrid methodology presents a promising framework for enhancing data quality in RL, fortifying the decision-making capacity of existing transformers without requiring alterations to existing model architectures. The strong empirical results and the adaptable framework signify its potential for further exploration and refinement in enhancing offline RL algorithm efficiency.

Implications and Future Directions

The integration of Q-learning within the transformer-based DT structure symbolizes an evolution in leveraging sequence modeling for more sophisticated RL applications. While exhibiting varied results across environments, QDT's blend of forward- and backward-looking strategies sets the groundwork for further exploration into combining architectural strengths in RL.

Future work might involve the exploration of improved Q-learning algorithms that prioritize accurate value estimation, an essential component for the relabelling process. Furthermore, an understanding of diverse state-action value functions—accounting for trajectory lengths and reward sparsity—would bolster model robustness and potentially extend this methodology to larger, more complex environments. Through its data-centric lens, QDT mirrors a shift towards data-quality improvements in machine intelligence, presenting another step in the bolstering of AI decision-making capabilities.

Markdown