Improving Device-Edge Cooperative Inference of Deep Learning via 2-Step Pruning (1903.03472v1)

Published 8 Mar 2019 in cs.NI and cs.LG

Abstract: Deep neural networks (DNNs) are state-of-the-art solutions for many machine learning applications, and have been widely used on mobile devices. Running DNNs on resource-constrained mobile devices often requires the help from edge servers via computation offloading. However, offloading through a bandwidth-limited wireless link is non-trivial due to the tight interplay between the computation resources on mobile devices and wireless resources. Existing studies have focused on cooperative inference where DNN models are partitioned at different neural network layers, and the two parts are executed at the mobile device and the edge server, respectively. Since the output data size of a DNN layer can be larger than that of the raw data, offloading intermediate data between layers can suffer from high transmission latency under limited wireless bandwidth. In this paper, we propose an efficient and flexible 2-step pruning framework for DNN partition between mobile devices and edge servers. In our framework, the DNN model only needs to be pruned once in the training phase where unimportant convolutional filters are removed iteratively. By limiting the pruning region, our framework can greatly reduce either the wireless transmission workload of the device or the total computation workload. A series of pruned models are generated in the training phase, from which the framework can automatically select to satisfy varying latency and accuracy requirements. Furthermore, coding for the intermediate data is added to provide extra transmission workload reduction. Our experiments show that the proposed framework can achieve up to 25.6$\times$ reduction on transmission workload, 6.01$\times$ acceleration on total computation and 4.81$\times$ reduction on end-to-end latency as compared to partitioning the original DNN model without pruning.

PDF Abstract

The paper introduces a novel 2-step pruning framework designed to enhance device-edge cooperative inference for Deep Neural Networks (DNNs). The key idea is to strategically prune the DNN model to minimize both computation and transmission overheads in resource-constrained mobile devices.

The authors address the challenge of offloading intermediate data between DNN layers, which can lead to high transmission latency due to bandwidth limitations. They propose a framework that prunes the DNN model in two steps, focusing on reducing computation workload and wireless transmission workload independently. The method iteratively removes unimportant convolutional filters during the training phase, generating a series of pruned models that can be automatically selected based on latency and accuracy requirements. Furthermore, the inclusion of intermediate data coding offers additional reduction in transmission workload.

The proposed framework consists of three main stages:

Offline Training and Pruning:
- The authors adopt an iterative pruning workflow inspired by NVIDIA's channel pruning technique [molchanov2016pruning].
- In the first pruning step, the entire network undergoes pruning to reduce the overall computation workload. The filters are ranked using a first-order Taylor expansion on the network loss function, and the insignificant ones are removed.
- The second pruning step focuses on reducing the transmission workload by individually pruning each layer of the pruned network obtained after the first step. This step generates a series of pruned network models, each corresponding to a specific partition point.
Online Model and Partition Point Selection:
- This stage involves selecting the best-pruned network model and its corresponding partition point based on the lowest end-to-end latency, given a specific accuracy constraint.
- The selection process takes into account layer-level output data size, computation latency profiles of the pruned models, tolerable accuracy loss, and system factors such as wireless channel conditions, and the computation capabilities of the mobile device and the edge server.
- The computation capability ratio, $\gamma$ , is defined as:
  
  $\gamma = \frac{t^{\rm{mobile}_{i}}{t^{\rm{device}_{i}}$
  
  where:
  - $t^{\rm{mobile}_{i}$ is the computation latency of the $i$ th layer on the mobile device.
  - $t^{\rm{device}_{i}$ is the computation latency of the $i$ th layer on the edge server.
- The average upload rate, $R$ $R$ , is used to calculate the transmission latency with partitioning at the $i$ $i$ th layer:
  
  $t^{\rm{transmission}_{i} = \frac{D_i}{R}$
  
  where:
  - $D_{i}$ is the volume of the $i$ th layer output data to be transmitted.
Deployment:
- The mobile device downloads the front-end part of the selected pruned model.
- The device performs local computation up to the partition point and offloads the remaining computation to the edge server via the wireless channel.

The authors conducted experiments using PyTorch with the VGG (Very Deep Convolutional Networks) model [simonyan2014very] and the CIFAR-10 dataset [krizhevsky2014cifar]. They simulated various system configurations by varying the computation capability ratio ( $\gamma$ ) between 0.1 and 100 and using typical average upload rates ( $R$ ) for 3G, 4G, and WiFi networks. The results demonstrate that the proposed 2-step pruning framework can achieve up to 25.6 $\times$ reduction in transmission workload and 6.01 $\times$ acceleration in computation compared to partitioning the original DNN model without pruning. The framework also achieved up to 4.81 $\times$ reduction in end-to-end latency in WiFi environments.

Additionally, the authors investigated bandwidth-accuracy tradeoffs by assuming that the network is partitioned at one of the max-pooling layers. They showed that the layers in the front-end part of the network are more sensitive to pruning compared to the layers in the back-end. They also evaluated the extra compression effect provided by adding a lossless PNG encoder and decoder at each max-pooling layer. Finally, they compared the proposed 2-step pruning framework with the feature coding approach [ko2018edge], demonstrating that the proposed approach outperforms feature coding, especially when partitioning at the back-end part of the network or when a high compression ratio is desired.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Wenqi Shi (21 papers)
Yunzhong Hou (17 papers)
Sheng Zhou (186 papers)
Zhisheng Niu (97 papers)
Yang Zhang (1129 papers)
Lu Geng (4 papers)

Citations (74)

View on Semantic Scholar

Improving Device-Edge Cooperative Inference of Deep Learning via 2-Step Pruning (1903.03472v1)

Related Papers