Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

The CheckPointing software DMTCP https://dmtcp.sourceforge.io is now available in O2.

Note

DMTCP software is NOT guaranteed to work nor or to support all applications and languages.

It might be possible that a A given process will might fail to run or restart from a saved checkpoint. We strongly encourage you to test DMTCP on your workflow before depending on it.

What is CheckPointing?

The process of CheckPointing consists in creating periodic snapshots of the running process and the active memory (RAM). Those snapshots can then be used to restart the execution of a given that process from the recorded point. The process is somehow similar to creating manual restarts restart points inside your code , but by saving important data with a given period. However, it is handled outside your code by the DMTCP software. It attempts to save all relevant data, allowing you to restart (e.g. after a crash) and get the same results you would have gotten by simply running the original program to completion.

How does it work

The two most common approaches for using DMTCP are to either checkpoint your execution at a given constant interval or to manually initiate checkpointing from within the code (when possible).

...

where CKP_FREQ is the checkpointing frequency in seconds and your_program is the command you need to run within the job. In this case DMTCP will create a memory checkpoint every CKP_FREQ seconds. (See the CAUTION below about choosing CKP_FREQs that are too small.)

Custom CheckPointing:

It is also possible to manually create checkpoints by starting the derided command without specifying an interval

...

Code Block
languagenone
def main():
# something here
 for it in range(0,some_number_here):
     print(it)
     # do something here that takes
     # a very long time
     os.system('dmtcp_command --checkpoint')
     
if __name__ == '__main__':
    main()

Note

CAUTION:

The creation of a checkpoint is a potentially time consuming process that can also generate very large files, depending on the RAM (memory) used by the running processes.

When a checkpoint is created DMTCP will write to file all data currently loaded on in RAM. Therefore, therefore a job using ~100GB of RAM will create a similar size of data, which could fill up your storage quota.
Checkpoint 100 jobs using only 1GB of RAM will also be enough to fill your $HOME storage quota.

We strongly advice to checkpoint advise checkpointing only long jobs that are expected to run for more than one day.

...