I work in a research group in the HPC field. Our group develops many
tools that use process checkpoint/restart (CR). Basically the people here have found
three CR mechanism that actually works:
– Probably the most robust framework to CR in Linux. Is a hybrid
– You can compile OpenMPI message passing library to checkpoint distributed applications using BLCR. This is very convenient in HPC.
– It looks that they are slowing down its development. The last
official release is 0.82 (June 16, 2009) and support kernel 2.6.30
(pretty old). To compile with newer kernels there are some patches
flowing in the development mailing list but I think only to give
support until 2.6.34 I think.
– You need root permissions to insert the blcr kernel module. One of
our tools used BLCR and we couldn’t run in many clusters because the
sysadmins were skeptical about inserting a kernel module with a few
random patches published in a mailing list.
– A completely user-space solution. You don’t need to bother the
sysadmins to install kernel modules.
– Can checkpoint distributed computation (we already tried with
OpenMPI and it also checkpoints the orte daemon).
– There is current development to add DMTCP to OpenMPI for parallel applications checkpointing as a alternative to BLCR.
– Since it is implemented in user-space it has a lot of workarounds to
maintain process state in userspace.
– Duplicates kernel-space process information.
– Only works with socket-based communications (it doesn’t work with
proprietary infiniband protocols for example).
– The checkpoint/restart mechanism is implemented in the kernel as
syscalls and some user-space tools.
– Their intention is to push the mechanism upstream for kernel inclusion.
– Since their implementation is kernel based it is very robust.
– The patch-set still didn’t make for kernel inclusion. And the the
whole subject is complicated . Not all kernel developers agree that
implement CR in the kernel is a good idea.
– You need a custom kernel that has linux-cr support.
So which CR mechanism you choose will depend on many factors (if you have
control over the machine, use sockets for communications, can use a custom kernel, etc).