Continual Task Learning through Adaptive Policy Self-Composition (2411.11364v1)

Published 18 Nov 2024 in cs.LG and cs.AI

Abstract: Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously learned tasks (stability). However, systematic analyses of this setting are scarce, and it remains unclear whether conventional continual learning (CL) methods are effective in continual offline RL (CORL) scenarios. In this study, we develop the Offline Continual World benchmark and demonstrate that traditional CL methods struggle with catastrophic forgetting, primarily due to the unique distribution shifts inherent to CORL scenarios. To address this challenge, we introduce CompoFormer, a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network. Upon encountering a new task, CompoFormer leverages semantic correlations to selectively integrate relevant prior policies alongside newly trained parameters, thereby enhancing knowledge sharing and accelerating the learning process. Our experiments reveal that CompoFormer outperforms conventional CL methods, particularly in longer task sequences, showcasing a promising balance between plasticity and stability.

Summary

The paper introduces CompoFormer, a novel model that uses adaptive policy self-composition to overcome challenges in continual offline reinforcement learning.
It employs a modular strategy with LoRA-based growth and pruning to dynamically compose policy networks and mitigate interference between tasks.
Experimental results on the OCW benchmark demonstrate significant reductions in forgetting and improved task transfer performance.

Continual Task Learning Through Adaptive Policy Self-Composition

The paper "Continual Task Learning Through Adaptive Policy Self-Composition" outlines a novel approach to continual offline reinforcement learning (CORL) by introducing a structure-based model called CompoFormer. This model seeks to address the intricate challenge of adapting to new tasks while maintaining performance on previously learned ones, thereby tackling the stability-plasticity dilemma pervasive in continual learning scenarios.

Background and Motivation

Continual learning (CL) is a critical aspect of machine learning, aiming for systems that can learn a sequence of tasks over time without experiencing catastrophic forgetting. In offline reinforcement learning (RL), the absence of continual interaction with the environment accentuates the distribution shift between tasks, making this challenge even more formidable. Traditional CL approaches in offline settings often fall short due to the cumulative interference from sequential task learning and inadequate handling of stability-plasticity tradeoffs. The motivation behind CompoFormer is to leverage the Transformer architecture's sequential modeling capabilities to facilitate knowledge transfer and mitigate distributional shifts inherent to offline RL.

Methodology

CompoFormer introduces the concept of meta-policy networks to address the outlined challenges. The architecture relies on a modular growing strategy, whereby structural modifications are made within the transformer model to accommodate new tasks. This is achieved through two distinct variants: CompoFormer-Grow, which expands its architecture with Low-Rank Adaptation (LoRA), and CompoFormer-Prune, which employs a pruning strategy to create efficient task-specific subnetworks.

An integral component of CompoFormer is the self-composing policy module, which utilizes a frozen Sentence-BERT for task description embedding. This facilitates an attention mechanism that dynamically composes the outputs of previous policies relevant to the new task at hand. This selective knowledge transfer significantly reduces cross-task interference, enhancing the learning efficiency and mitigating forgetting.

Experimental Results

The experimental evaluation conducted with the proposed Offline Continual World (OCW) benchmark showcases CompoFormer's advancements over existing methods. The results demonstrate significant reductions in forgetting across diverse task sequences, especially in longer scaled settings such as OCW20. CompoFormer's ability to balance plasticity (learning new tasks) and stability (retaining prior knowledge) is evident, underscoring its efficacy in tackling the stability-plasticity tradeoff more comprehensively than regularization-based, structure-based, and rehearsal-based methods. Notably, the insightful utilization of LoRA and pruning mechanisms within the transformer architecture stands out as a key differentiator contributing to its superior performance.

Implications and Future Directions

The paper proposes a significant stride towards enabling more generalizable and efficient continual learning agents within the offline RL paradigm. CompoFormer's innovative framework paves the way for further exploration into modular architectures and their capacity to expand dynamically while maintaining efficiency and performance. The model's effectiveness in leveraging semantic task correlations opens avenues for advanced attention mechanisms and meta-learning strategies to enhance adaptability across diverse task distributions.

Future research may delve into optimizing the computational costs associated with such growing architectures, addressing challenges in real-world applications that necessitate near-constant adaptation and robustness. Additionally, exploring the transferability of these adaptive structures in online environments or integrating them with other modal tasks could further enhance their applicability and scope.

In conclusion, "Continual Task Learning Through Adaptive Policy Self-Composition" presents a compelling approach to overcoming the intrinsic challenges of continual task learning in offline settings, with findings that hold significant implications for the development of robust, lifelong learning systems in artificial intelligence.

PDF Markdown