Skip to main content


Corporate InformationResearch & Development

Industrial AI blog

Improving device-edge cooperative inference of deep learning via 2-step pruning

19 October 2020

Yang Zhang
Hitachi (China) Research & Development Corporation

Lu Geng
Hitachi (China) Research & Development Corporation

Deep neural networks (DNNs) are state-of-the-art solutions for many machine learning applications, and have been widely used on mobile devices. When we run DNN (deep neural networks) on resource constrained mobile devices, we offload computation from mobile devices to edge servers. However, offloading itself can present a major challenge as data transmission using band-width limited wireless links between mobile devices and edge servers is time consuming. Previous research to resolve this issue has focused on the effect of the selection of partition position. The two parts of DNN are stored on the mobile device and edge server and executed, respectively (Figure 1). There is the possibility however that when the output data size of a DNN layer is larger than that of the raw data, this will cause high transmission latency. From our studies, which I have summarized below, we are recommending a 2-step pruning framework for DNN partition between mobile devices and edge servers for efficiency and flexibility.

Pruning is a means of reducing the size of neural networks. In our recommended framework, the DNN model is pruned in the training phase where unimportant convolutional filters are removed. The recommended framework can greatly reduce either the transmission workload or the computation workload. Pruned models are automatically selected to satisfy latency and accuracy requirements. Experiments showed the effect of the framework.

Figure 1: An illustration of device-edge cooperative inference.

The recommended framework contains three stages: offline training and pruning stage, online selection stage and deployment stage. In the 1st stage, two pruning steps are carried out, where pruning takes place in each entire layer in the whole network and restricts the action ranges for each layer. In the 2nd stage, a partition point is selected according to the restriction of latency-accuracy constraint. In the 3rd stage the computation is divided for mobile devices and edge servers.

Figure 2: Proposed 2-step pruning framework.

The pruning experiment is carried out on VGG.[14] Figure 3 shows the transmission workload and cumulative computation time at each layer for the original VGG model, VGG model after pruning step 1 and VGG model after pruning step 2, respectively. A 25.6× reduction in transmission workload and a 6.01× acceleration in computation compared to the original model are achieved.

Figure 3: Layer level transmission and computation characteristics of the original, step 1 pruned and step 2 pruned VGG.

Table 1 shows the end-to-end latency improvements for three typical mobile networks (3G, 4G and WiFi) with computation capability ratio selected as 5. By applying our 2-step-pruning framework, a 4.8× acceleration can be achieved in WiFi environments.

Table 1: End-to-end latency improvement under 3 typical mobile networks with selected computation capability ratio.

Based on above analysis and simulation, we are recommending the 2-step pruning framework through which the users could balance their preference between computation workload and transmission workload.

For more details, we encourage you to read our paper, “Improving Device-Edge Cooperative Inference of Deep Learning via 2-Step Pruning”, which can be accessed at


Thanks to my co-authors Wenqi Shi, Yunzhong Hou, Sheng Zhou, Zhisheng Niu from Tsinghua University, with whom this research work was jointly executed.


I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. MIT press Cambridge, 2016, vol. 1.
N. Abbas, Y. Zhang, A. Taherkordi, and T. Skeie, “Mobile edge computing: A survey,” IEEE Internet of Things Journal, vol. 5, no. 1, pp. 450–465, 2018.
Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, and L. Tang, “Neurosurgeon: Collaborative intelligence between the cloud and mobile edge,” SIGOPS Oper. Syst. Rev., vol. 51, no. 2, pp. 615–629, Apr. 2017.
S. Teerapittayanon, B. McDanel, and H. Kung, “Distributed deep neural networks over the cloud, the edge and end devices,” in Distributed Computing Systems (ICDCS), 2017 IEEE 37th International Conference on. IEEE, 2017, pp. 328–339.
A. E. Eshratifar, M. S. Abrishami, and M. Pedram, “JointDNN: an efficient training and inference engine for intelligent mobile cloud computing services,” arXiv preprint arXiv:1801.08618, 2018.
E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep learning model co-inference with device-edge synergy,” in Proceedings of the 2018 Workshop on Mobile Edge Communications, ser. MECOMM’18. New York, NY, USA: ACM, 2018, pp. 31–36.
J. H. Ko, T. Na, M. F. Amir, and S. Mukhopadhyay, “Edge-host partitioning of deep neural networks with feature space encoding for resource-constrained internet-of-things platforms,” arXiv preprint arXiv:1802.03835, 2018.
A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional neural networks for resource efficient inference,” International Conference on Learning Representations, 2017. [Online]. Available:
 J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in 2017 IEEE International Conference on Computer Vision (ICCV). IEEE, 2017, pp. 5068–5076.
 Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in International Conference on Computer Vision (ICCV), vol. 2, no. 6, 2017.
 K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E. S. Chung, “Accelerating deep convolutional neural networks using specialized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11, 2015.
 S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: efficient inference engine on compressed deep neural network,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd Annual International Symposium on. IEEE, 2016, pp. 243–254.
 K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
 A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 dataset,” online: http://www. cs. toronto. edu/kriz/cifar. html, 2014.
[16], “State of mobile networks: USA,”, 2018, accessed: 2018-12-27.
[17], “United States speedtest market report,” , 2018, accessed: 2018-12-27.