Policy Distillation (1511.06295v2)

Published 19 Nov 2015 in cs.LG

Abstract: Policies for complex visual tasks have been successfully learned with deep reinforcement learning, using an approach called deep Q-networks (DQN), but relatively large (task-specific) networks and extensive training are needed to achieve good performance. In this work, we present a novel method called policy distillation that can be used to extract the policy of a reinforcement learning agent and train a new network that performs at the expert level while being dramatically smaller and more efficient. Furthermore, the same method can be used to consolidate multiple task-specific policies into a single policy. We demonstrate these claims using the Atari domain and show that the multi-task distilled agent outperforms the single-task teachers as well as a jointly-trained DQN agent.

Citations (654)

View on Semantic Scholar

Summary

The paper’s main contribution is the use of policy distillation to compress and consolidate multiple deep RL policies into a single effective model.
It demonstrates up to 15-fold reduction in model size and enhanced multi-task efficiency in challenging environments like Atari.
The study employs a KL divergence loss with a softened target to enable robust online distillation, facilitating efficient RL deployments on limited-resource devices.

Policy Distillation

The paper "Policy Distillation" introduces a significant advancement in the optimization of deep reinforcement learning (RL) models through a method known as policy distillation. This approach addresses several challenges prevalent in reinforcement learning, particularly those encountered when using deep Q-networks (DQN).

Methodology

The authors propose policy distillation as a solution to condense the policy of a reinforcement learning agent into a more compact form without losing its effectiveness. The method focuses on extracting a policy from a DQN and training a smaller network to replicate the expert-level performance of the original, larger network. Moreover, this distilled network can integrate multiple task-specific policies into a single, multi-task capable policy.

Experimental Results

The research highlights the application of policy distillation in the Atari domain. Key findings are:

Compression: Networks can be reduced in size by up to 15 times with no performance degradation. This compression is particularly relevant when deploying models where computational resources are limited.
Multi-task Distillation: By consolidating multiple policies into a single network, the distilled agent not only matched but outperformed the original task-specific DQN agents and even outpaced a DQN agent trained on multiple tasks. This result suggests improved efficiency and generalization in learning multiple tasks concurrently.
Online Learning: The paper discusses online distillation capabilities, allowing the network to continuously learn and distill evolving policies, providing a means to efficiently track and update the best-performing strategies.

Architecture and Loss Functions

The research explores several loss functions, establishing that the Kullback-Leibler (KL) divergence with a "softened" target distribution provides superior training efficacy. This choice allows the network to more accurately transfer and stabilize the high-level strategy of the teacher model into the student model.

Implications

The implications of policy distillation extend to both theoretical and practical aspects of AI development:

Theoretical Implications: This work modifies prevailing conceptions in RL by demonstrating that techniques traditionally reserved for supervised learning can generalize effectively to sequential decision-making tasks.
Practical Implications: The ability to distill and compress models opens avenues for more efficient deployment of RL agents on devices with constrained computing capabilities. It also simplifies the maintenance of RL models by consolidating multiple policies into singular, more manageable models.

Future Directions

Future research could explore further enhancements in distillation techniques, potentially integrating them with other advanced RL algorithms beyond DQN. Additionally, expanding the applicability to continuous action spaces and more complex environments remains a promising avenue for future exploration.

In summary, this paper provides valuable insights into optimizing RL through policy distillation, presenting a robust and efficient framework for improving policy representation and deployment in real-world applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/geaux_eth/status/1791948165899288640

YouTube

Show All Videos