The paper introduces a novel 2-step pruning framework designed to enhance device-edge cooperative inference for Deep Neural Networks (DNNs). The key idea is to strategically prune the DNN model to minimize both computation and transmission overheads in resource-constrained mobile devices.
The authors address the challenge of offloading intermediate data between DNN layers, which can lead to high transmission latency due to bandwidth limitations. They propose a framework that prunes the DNN model in two steps, focusing on reducing computation workload and wireless transmission workload independently. The method iteratively removes unimportant convolutional filters during the training phase, generating a series of pruned models that can be automatically selected based on latency and accuracy requirements. Furthermore, the inclusion of intermediate data coding offers additional reduction in transmission workload.
The proposed framework consists of three main stages:
- Offline Training and Pruning:
- The authors adopt an iterative pruning workflow inspired by NVIDIA's channel pruning technique [molchanov2016pruning].
- In the first pruning step, the entire network undergoes pruning to reduce the overall computation workload. The filters are ranked using a first-order Taylor expansion on the network loss function, and the insignificant ones are removed.
- The second pruning step focuses on reducing the transmission workload by individually pruning each layer of the pruned network obtained after the first step. This step generates a series of pruned network models, each corresponding to a specific partition point.
- Online Model and Partition Point Selection:
- This stage involves selecting the best-pruned network model and its corresponding partition point based on the lowest end-to-end latency, given a specific accuracy constraint.
- The selection process takes into account layer-level output data size, computation latency profiles of the pruned models, tolerable accuracy loss, and system factors such as wireless channel conditions, and the computation capabilities of the mobile device and the edge server.
The computation capability ratio, , is defined as:
$\gamma = \frac{t^{\rm{mobile}_{i}}{t^{\rm{device}_{i}}$
where:
- $t^{\rm{mobile}_{i}$ is the computation latency of the th layer on the mobile device.
- $t^{\rm{device}_{i}$ is the computation latency of the th layer on the edge server.
- The average upload rate, , is used to calculate the transmission latency with partitioning at the th layer:
$t^{\rm{transmission}_{i} = \frac{D_i}{R}$
where:
- is the volume of the th layer output data to be transmitted.
- Deployment:
- The mobile device downloads the front-end part of the selected pruned model.
- The device performs local computation up to the partition point and offloads the remaining computation to the edge server via the wireless channel.
The authors conducted experiments using PyTorch with the VGG (Very Deep Convolutional Networks) model [simonyan2014very] and the CIFAR-10 dataset [krizhevsky2014cifar]. They simulated various system configurations by varying the computation capability ratio () between 0.1 and 100 and using typical average upload rates () for 3G, 4G, and WiFi networks. The results demonstrate that the proposed 2-step pruning framework can achieve up to 25.6 reduction in transmission workload and 6.01 acceleration in computation compared to partitioning the original DNN model without pruning. The framework also achieved up to 4.81 reduction in end-to-end latency in WiFi environments.
Additionally, the authors investigated bandwidth-accuracy tradeoffs by assuming that the network is partitioned at one of the max-pooling layers. They showed that the layers in the front-end part of the network are more sensitive to pruning compared to the layers in the back-end. They also evaluated the extra compression effect provided by adding a lossless PNG encoder and decoder at each max-pooling layer. Finally, they compared the proposed 2-step pruning framework with the feature coding approach [ko2018edge], demonstrating that the proposed approach outperforms feature coding, especially when partitioning at the back-end part of the network or when a high compression ratio is desired.