Device Placement Optimization with Reinforcement Learning (1706.04972v2)

Published 13 Jun 2017 in cs.LG and cs.AI

Abstract: The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-to-sequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for LLMing and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.

Citations (426)

View on Semantic Scholar

Summary

The paper introduces a novel RL method using a sequence-to-sequence model with attention to optimize task placement across CPUs and GPUs.
The methodology employs policy gradients with distributed training and co-location heuristics to minimize execution time compared to heuristics.
The results show significant speed improvements, with up to 23.5% faster performance on multi-GPU setups over expert-crafted placements.

Device Placement Optimization with Reinforcement Learning: An Overview

The paper "Device Placement Optimization with Reinforcement Learning" investigates an automated approach to optimize the allocation of computational tasks across diverse hardware devices like CPUs and GPUs. This problem arises from the increasing complexity of neural networks, which require significant computational resources for both training and inference. Traditionally, the task of device placement has been handled by human experts using heuristic methods. This paper proposes a novel method utilizing reinforcement learning (RL), specifically a sequence-to-sequence model, to enhance device placement efficiency for TensorFlow computational graphs.

Methodology

The core of the paper's methodology is the use of a sequence-to-sequence model, integrated with an attentional mechanism, to predict optimal placements of operations in a neural network across available devices. The model's parameters are optimized using policy gradients and the REINFORCE algorithm, targeting the minimization of execution time as the primary reward signal. The approach also incorporates a co-location heuristic to reduce the complexity of placement by grouping certain operations together.

Key Procedural Details:

Training Process: The model samples different placements and evaluates them using a reward signal based on their execution time. It introduces a baseline adjustment technique to reduce variance in policy gradient estimations.
Architecture: The encoder RNN ingests operation sequences with embedded information about operation type, output size, and adjacency. The decoder RNN, through an attentional mechanism, outputs device assignments.
Distributed Training: Asynchronous distributed training is leveraged to expedite the learning process, employing multiple controllers and workers in a parallelized setup.

Results and Comparisons

The method was tested on three prominent deep learning models: Recurrent Neural Network LLM (RNNLM), Neural Machine Translation (NMT) with attention, and Inception-V3. The RL-based strategy was compared with traditional baselines such as single-device placements, heuristic-based methods like MinCut, and expert-crafted setups.

Numerical Findings:

For RNNLM, the RL-based model equaled the efficiency of single-GPU placement, achieving significant speed-up over handcrafted models.
For NMT, the RL model achieved up to 23.5% speed increase on 2 GPUs and 20.6% on 4 GPUs, surpassing expert-designed placements.
Inception-V3 results showed up to 19.0% improvement, highlighting the RL approach's superior balancing of computational and communication costs.

Implications and Future Directions

The result demonstrates that the proposed RL-based approach can significantly enhance the efficiency of device placement, reducing the reliance on heuristics and expert knowledge. This could lead to broader applications in optimizing neural networks across various architectures and hardware configurations. Future research could explore extensions to accommodate more complex architectures and reduce training noise.

By automating device placements, this paper paves the way for more adaptive, autonomous systems capable of optimizing their computational resource use. This aligns with ongoing advancements in artificial intelligence, where systems increasingly self-improve through environmental feedback. Future exploration may delve into more sophisticated RL models or the integration of cost models that better capture real-time hardware constraints and dynamic execution environments.

Related Papers

Tweets

https://twitter.com/JeffDean/status/1908761688842916023

YouTube

Show All Videos