AbstractThis paper presents a fault tolerance algorithm for a home-based lazy release consistency distributed shared memory (DSM) system based on volatile logging and independent checkpointing. The proposed approach targets large-scale distributed shared-memory computing on local-area clusters of computers as well as collaborative shared-memory applications on wide-area meta-clusters over the Internet. The challenge in building such systems lies in controlling the size of the logs and to garbage collect the unnecessary checkpoints in the absence of global coordination. In this paper we define a set of rules for lazy log trimming (LLT) and checkpoint garbage collection (CGC) and prove that they do not affect the recoverability of the system. We have implemented our logging algorithm in a home-based DSM system and showed on three representative applications that our scheme effectively bounds the size of the logs and the number of checkpointed page versions kept in stable storage.
RightsThis Item is protected by copyright and/or related rights.You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use.For other uses you need to obtain permission from the rights-holder(s).