Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 3 Next »

The CheckPointing software DMTCP https://dmtcp.sourceforge.io is now available in O2.

DMTCP software is NOT guaranteed to work nor support all applications and languages.

It might be possible that a given process will fail to run or restart from a saved checkpoint.

What is CheckPointing

The process of CheckPointing consists in creating periodic snapshots of the running process and the active memory (RAM). Those snapshots can then be used to restart the execution of a given process from the recorded point. The process is somehow similar to creating manual restarts points inside your code, but it is handled outside your code by the DMTCP software.

How does it work

The two most common approaches for using DMTCP are to either checkpoint your execution at a given constant interval or to manually initiate checkpointing from within the code (when possible).

In both cases the first step is to load the dmtcp module with either module load gcc/6.2.0 dmtcp or module load gcc/9.2.0 dmtcp.

Constant Interval CheckPointing:

After loading the dmtcp module you should be able to start your command with:

dmtcp_launch --interval CKP_FREQ your_program

where CKP_FREQ is the checkpointing frequency in seconds and your_program is the command you need to run within the job. In this case DMTCP will create a memory checkpoint every CKP_FREQ seconds.

Custom CheckPointing:

It is also possible to manually create checkpoints by starting the derided command without specifying an interval

dmtcp_launch your_program

and by executing the shell command

dmtcp_command --checkpoint 

directly from within your code, placing it at strategic points of the code. In the python example below the dmtcp_command could be placed at the beginning of a loop where each iteration consumes a very large amount of time:

def main():
# something here
 for it in range(0,some_number_here):
     print(it)
     os.system('dmtcp_command --checkpoint')
     # do something here that takes
     # a very long time
     
if __name__ == '__main__':
    main()

CAUTION:

The creation of a checkpoint is a potentially time consuming process that can also generate very large files, depending on the RAM (memory) used by the running processes.

When a checkpoint is created DMTCP will write to file all data currently loaded on RAM, therefore a job using ~100GB of RAM will create a similar size of data, which could fill up your storage quota.
Checkpoint 100 jobs using only 1GB of RAM will also be enough to fill your $HOME storage quota.

We strongly advice to checkpoint only long jobs that are expected to run for more than one day.

How to restart your checkpoint job

After loading the dmtcp module you can restart a job from its last checkpoint using the commands:

export DMTCP_COORD_HOST=$( hostname ) ### this might change depending on your shell
./dmtcp_restart_script.sh

from the folder where the job’s checkpoint was created.

You can also build a script that will automatically restart a program from its last checkpoint if one is available using a sbatch script like:

#!/bin/bash

#SBATCH -p a_partition
#SBATCH -c the_num_of_CPUs
#SBATCH -t the_walltime
#SBATCH --mem=the_memory

export program="python3 my_code.py input1 input2"
export CKP_FREQ=86400

if [ -f dmtcp_restart_script.sh ]
then
 export DMTCP_COORD_HOST=$( hostname )
 ./dmtcp_restart_script.sh
else
 dmtcp_launch --interval $CKP_FREQ ${program}
fi 

with the assumption that each job is executed in a separate folder and therefore no dmtcp_restart_script.sh file is present when a job is dispatched the first time.

Important additional information

The version of dmtcp currently installed in O2 does not support MPI

DMTCP cannot be used to checkpoint GPU processes.

However you might be able to checkpoint a GPU job by executing custom checkpointing in your code where you know the GPU is not been used and any data currently stored on GPU memory (VRAM) is not required to restart the job.

  • No labels