Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents


The CheckPointing Checkpointing saves a job’s current state to disk; you can restart the job from that saved point at a later time. This can provide protection against job failures due to bugs, network errors, full disks, or node failures. Checkpointing is especially useful for jobs running for multiple days, but should not be used for short jobs.

The Checkpointing software DMTCP https://dmtcp.sourceforge.io is available in O2.

Note

DMTCP software is NOT guaranteed to work or to support all applications and languages.

A given process might fail to run or restart from a saved checkpoint. We strongly encourage you to test DMTCP on your workflow before depending on it.

Note

MPI jobs will not checkpoint with the version of dmtcpcurrently installed on O2.
GPU jobs will only checkpoint in limited circumstances. See below.

What is CheckPointing?

The process of CheckPointing consists in creating periodic snapshots of the running process and the active memory (RAM). Those snapshots can then be used to restart the execution of that process from the recorded point. The process is similar to creating manual restart points inside your code by saving important data with a given period. However, it is handled outside your code by the DMTCP software.

...

Code Block
languagenone
def main():
# something here
 for it in range(0,some_number_here):
     print(it)
     # do something here that takes
     # a very long time
     os.system('dmtcp_command --checkpoint')
     
if __name__ == '__main__':
    main()

Note

CAUTION:

The creation of a checkpoint is a potentially time consuming process that can also generate very large files, depending on the RAM (memory) used by the running processes.

When a checkpoint is created DMTCP will write to file all data currently loaded in RAM. Therefore, a job using ~100GB of RAM will create a similar size of data, which could fill up your storage quota.
Checkpoint 100 jobs using only 1GB of RAM will also be enough to fill your $HOME storage quota. We therefore encourage you to write checkpoint data to your scratch folder when possible.

We strongly advise checkpointing only jobs that are expected to run for more than one day. Checkpointing short jobs may significantly slow them down without substantial benefit.

How to restart your checkpoint job

...

If you start a job with the above template and you don’t want it to restart from the saved checkpoints make sure to delete any local file created by DMTCP.

...

Checkpointing GPU jobs

...

Note

...

DMTCP cannot be used to checkpoint GPU processes.

However you might be able to checkpoint a GPU job by executing custom checkpointing in your code where you know the GPU is not

...

being used, and

...

data currently stored on GPU memory (VRAM) is not required to restart the job.