Asymptotic efficiency of restart and checkpointing (1802.07455v2)

Published 21 Feb 2018 in math.PR and cs.PF

Abstract: Many tasks are subject to failure before completion. Two of the most common failure recovery strategies are restart and checkpointing. Under restart, once a failure occurs, it is restarted from the beginning. Under checkpointing, the task is resumed from the preceding checkpoint after the failure. We study asymptotic efficiency of restart for an infinite sequence of tasks, whose sizes form a stationary sequence. We define asymptotic efficiency as the limit of the ratio of the total time to completion in the absence of failures over the total time to completion when failures take place. Whether the asymptotic efficiency is positive or not depends on the comparison of the tail of the distributions of the task size and the random variables governing failures. Our framework allows for variations in the failure rates and dependencies between task sizes. We also study a similar notion of asymptotic efficiency for checkpointing when the task is infinite a.s. and the inter-checkpoint times are i.i.d.. Moreover, in checkpointing, when the failures are exponentially distributed, we prove the existence of an infinite sequence of universal checkpoints, which are always used whenever the system starts from any checkpoint that precedes them.