FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations (2209.14399v3)
Abstract: In edge computing, users' service profiles are migrated due to user mobility. Reinforcement learning (RL) frameworks have been proposed to do so, often trained on simulated data. However, existing RL frameworks overlook occasional server failures, which although rare, impact latency-sensitive applications like autonomous driving and real-time obstacle detection. Nevertheless, these failures (rare events), being not adequately represented in historical training data, pose a challenge for data-driven RL algorithms. As it is impractical to adjust failure frequency in real-world applications for training, we introduce FIRE, a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment. We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function. FIRE considers delay, migration, failure, and backup placement costs across individual and shared service profiles. We prove ImRE's boundedness and convergence to optimality. Next, we introduce novel deep Q-learning (ImDQL) and actor critic (ImACRE) versions of our algorithm to enhance scalability. We extend our framework to accommodate users with varying risk tolerances. Through trace driven experiments, we show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
- Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication perspective,” IEEE Communications Surveys & Tutorials, vol. 19, no. 4, pp. 2322–2358, 2017.
- P. Mach and Z. Becvar, “Mobile edge computing: A survey on architecture and computation offloading,” IEEE Communications Surveys Tutorials, vol. 19, no. 3, pp. 1628–1656, 2017.
- Y. Mao, J. Zhang, and K. B. Letaief, “Dynamic computation offloading for mobile-edge computing with energy harvesting devices,” IEEE Journal on Selected Areas in Communications, vol. 34, no. 12, pp. 3590–3605, 2016.
- X. Chen, L. Jiao, W. Li, and X. Fu, “Efficient multi-user computation offloading for mobile-edge cloud computing,” IEEE/ACM Transactions on Networking, vol. 24, no. 5, pp. 2795–2808, 2015.
- T. Q. Dinh, J. Tang, Q. D. La, and T. Q. Quek, “Offloading in mobile edge computing: Task allocation and computational frequency scaling,” IEEE Transactions on Communications, vol. 65, no. 8, pp. 3571–3584, 2017.
- M. Siew, S. Sharma, K. Guo, D. Cai, W. Wen, C. Joe-Wong, and T. Q. Quek, “Towards effective resource procurement in mec: a resource re-selling framework,” IEEE Transactions on Services Computing, 2023.
- S. Wang, R. Urgaonkar, M. Zafer, T. He, K. Chan, and K. K. Leung, “Dynamic service migration in mobile edge computing based on markov decision process,” IEEE/ACM Transactions on Networking, vol. 27, no. 3, pp. 1272–1288, 2019.
- T. Ouyang, Z. Zhou, and X. Chen, “Follow me at the edge: Mobility-aware dynamic service placement for mobile edge computing,” IEEE Journal on Selected Areas in Communications, vol. 36, no. 10, pp. 2333–2345, 2018.
- S. Wang, J. Xu, N. Zhang, and Y. Liu, “A survey on service migration in mobile edge computing,” IEEE Access, vol. 6, pp. 23 511–23 528, 2018.
- K. Guo, R. Gao, W. Xia, and T. Q. Quek, “Online learning based computation offloading in mec systems with communication and computation dynamics,” IEEE Transactions on Communications, vol. 69, no. 2, pp. 1147–1162, 2020.
- S. Wang, Y. Guo, N. Zhang, P. Yang, A. Zhou, and X. Shen, “Delay-aware microservice coordination in mobile edge computing: A reinforcement learning approach,” IEEE Transactions on Mobile Computing, vol. 20, no. 3, pp. 939–951, 2019.
- B. Gao, Z. Zhou, F. Liu, F. Xu, and B. Li, “An online framework for joint network selection and service placement in mobile edge computing,” IEEE Transactions on Mobile Computing, pp. 1–1, 2021.
- T. Ouyang, R. Li, X. Chen, Z. Zhou, and X. Tang, “Adaptive user-managed service placement for mobile edge computing: An online learning approach,” in IEEE INFOCOM 2019-IEEE conference on computer communications. IEEE, 2019, pp. 1468–1476.
- M. Siew, K. Guo, D. Cai, L. Li, and T. Q. Quek, “Let’s share vms: Optimal placement and pricing across base stations in mec systems,” in IEEE INFOCOM 2021-IEEE Conference on Computer Communications. IEEE, 2021, pp. 1–10.
- T. Kim, S. D. Sathyanarayana, S. Chen, Y. Im, X. Zhang, S. Ha, and C. Joe-Wong, “Modems: Optimizing edge computing migrations for user mobility,” in IEEE INFOCOM 2022-IEEE Conference on Computer Communications. IEEE, 2022, pp. 1159–1168.
- D. Zeng, L. Gu, S. Pan, J. Cai, and S. Guo, “Resource management at the network edge: A deep reinforcement learning approach,” IEEE Network, vol. 33, no. 3, pp. 26–33, 2019.
- Z. Gao, Q. Jiao, K. Xiao, Q. Wang, Z. Mo, and Y. Yang, “Deep reinforcement learning based service migration strategy for edge computing,” in 2019 IEEE International Conference on Service-Oriented System Engineering (SOSE), 2019, pp. 116–1165.
- L. Ponemon, “Cost of data center outages,” Data Center Performance Benchmark Serie, 2016.
- A. Aral and I. Brandić, “Learning spatiotemporal failure dependencies for resilient edge computing services,” IEEE Transactions on Parallel and Distributed Systems, vol. 32, no. 7, pp. 1578–1590, 2020.
- J. Frank, S. Mannor, and D. Precup, “Reinforcement learning in the presence of rare events,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 336–343.
- H. Hong, Q. Wu, F. Dong, W. Song, R. Sun, T. Han, C. Zhou, and H. Yang, “Netgraph: An intelligent operated digital twin platform for data center networks,” in Proceedings of the ACM SIGCOMM 2021 Workshop on Network-Application Integration, 2021, pp. 26–32.
- Y. Zhang, L. Jiao, J. Yan, and X. Lin, “Dynamic service placement for virtual reality group gaming on mobile edge cloudlets,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 8, pp. 1881–1897, 2019.
- A. Aral and T. Ovatman, “A decentralized replica placement algorithm for edge computing,” IEEE Transactions on Network and Service Management, vol. 15, no. 2, pp. 516–529, 2018.
- T. Taleb and A. Ksentini, “An analytical model for follow me cloud,” in 2013 IEEE Global Communications Conference (GLOBECOM), 2013, pp. 1291–1296.
- A. Ksentini, T. Taleb, and M. Chen, “A markov decision process-based service migration procedure for follow me cloud,” in 2014 IEEE ICC, 2014, pp. 1350–1354.
- S. N. Shirazi, A. Gouglidis, A. Farshad, and D. Hutchison, “The extended cloud: Review and analysis of mobile edge computing and fog from a security and resilience perspective,” IEEE Journal on Selected Areas in Communications, vol. 35, no. 11, pp. 2586–2595, 2017.
- C.-F. Liu, M. Bennis, M. Debbah, and H. V. Poor, “Dynamic task offloading and resource allocation for ultra-reliable low-latency edge computing,” IEEE Transactions on Communications, vol. 67, no. 6, pp. 4132–4150, 2019.
- H. Huang and S. Guo, “Proactive failure recovery for nfv in distributed edge computing,” IEEE Communications Magazine, vol. 57, no. 5, pp. 131–137, 2019.
- G. Yao, X. Li, Q. Ren, and R. Ruiz, “Failure-aware elastic cloud workflow scheduling,” IEEE Transactions on Services Computing, pp. 1–14, 2022.
- Y. Liang, Z. Hu, and L. Yang, “A two-stage replica management mechanism for latency-aware applications in multi-access edge computing,” in 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). IEEE, 2021, pp. 453–459.
- R. Zhang, F. R. Yu, J. Liu, T. Huang, and Y. Liu, “Deep reinforcement learning (drl)-based device-to-device (d2d) caching with blockchain and mobile edge computing,” IEEE Transactions on Wireless Communications, vol. 19, no. 10, pp. 6469–6485, 2020.
- W. Du, Q. He, Y. Ji, C. Cai, and X. Zhao, “Optimal user migration upon server failures in edge computing environment,” in 2021 IEEE International Conference on Web Services (ICWS). IEEE, 2021, pp. 272–281.
- M. Siew, S. Sharma, and C. Joe-Wong, “Acre: Actor critic reinforcement learning for failure-aware edge computing migrations,” in 2023 57th Annual Conference on Information Sciences and Systems (CISS). IEEE, 2023, pp. 1–6.
- P. A. Apostolopoulos, E. E. Tsiropoulou, and S. Papavassiliou, “Risk-aware data offloading in multi-server multi-access edge computing environment,” IEEE/ACM Transactions on Networking, vol. 28, no. 3, pp. 1405–1418, 2020.
- M. S. Munir, S. F. Abedin, N. H. Tran, Z. Han, E.-N. Huh, and C. S. Hong, “Risk-aware energy scheduling for edge computing with microgrid: A multi-agent deep reinforcement learning approach,” IEEE Transactions on Network and Service Management, vol. 18, no. 3, pp. 3476–3497, 2021.
- H. Badri, T. Bahreini, D. Grosu, and K. Yang, “Risk-aware application placement in mobile edge computing systems: A learning-based optimization approach,” in 2020 IEEE International Conference on Edge Computing (EDGE), 2020, pp. 83–90.
- X. Chen, H. Zhang, C. Wu, S. Mao, Y. Ji, and M. Bennis, “Optimized computation offloading performance in virtual edge computing systems via deep reinforcement learning,” IEEE Internet of Things Journal, vol. 6, no. 3, pp. 4005–4018, 2018.
- J. Zhang, X. Hu, Z. Ning, E. C.-H. Ngai, L. Zhou, J. Wei, J. Cheng, and B. Hu, “Energy-latency tradeoff for energy-aware offloading in mobile edge computing networks,” IEEE Internet of Things Journal, vol. 5, no. 4, pp. 2633–2645, 2017.
- D. Precup, “Eligibility traces for off-policy policy evaluation,” Computer Science Department Faculty Publication Series, p. 80, 2000.
- U. Madhushani, B. Dey, N. E. Leonard, and A. Chakraborty, “Hamiltonian q-learning: Leveraging importance-sampling for data efficient rl,” arXiv preprint arXiv:2011.05927, 2020.
- L. Ma, S. Yi, and Q. Li, “Efficient service handoff across edge servers via docker container migration,” in Proceedings of the Second ACM/IEEE Symposium on Edge Computing, 2017, pp. 1–13.
- J. Wu, E. W. Wong, Y.-C. Chan, and M. Zukerman, “Energy efficiency-qos tradeoff in cellular networks with base-station sleeping,” in GLOBECOM 2017-2017 IEEE Global Communications Conference. IEEE, 2017, pp. 1–7.
- “This paper’s technical report.” [Online]. Available: http://tinyurl.com/RareEventsEC
- J. N. Tsitsiklis, “Asynchronous stochastic approximation and q-learning,” Machine learning, vol. 16, no. 3, pp. 185–202, 1994.
- T. Jaakkola, M. Jordan, and S. Singh, “Convergence of stochastic iterative dynamic programming algorithms,” Advances in neural information processing systems, vol. 6, 1993.
- C. J. C. H. Watkins, “Learning from delayed rewards,” 1989.
- “Code for algorithm imdql.” [Online]. Available: https://github.com/ShikhSh/MultiUser-FIRE
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in International conference on machine learning. PMLR, 2016, pp. 1928–1937.
- “ns3 network simulator.” [Online]. Available: https://www.nsnam.org/
- Z. Rejiba, X. Masip-Bruin, and E. Marín-Tordera, “A survey on mobility-induced service migration in the fog, edge, and related computing paradigms,” ACM Computing Surveys (CSUR), vol. 52, no. 5, pp. 1–33, 2019.