NOTICE: FULL O2 Cluster Outage, January 3 - January 10th
O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10
- on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
- on Jan 3 (6:00 PM): O2 systems will start being powered off.
This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.
Specifically:
- The O2 Cluster will be completely offline, including O2 Portal.
- All data on O2 will be inaccessible.
- Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
- Websites on O2 will be completely offline, including all web content.
More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation
Get more informative slurm email notification
Here is an example (Note: '-n 1' is needed for srun command. It make sure only run one copy of the commands.):
Create a file myJob.sh with ( you can replace "echo firstCommand; echo secondCommand' with your own commands):
#!/bin/bash
srun -n 1 -t $SRUNTIME --mem $SRUNMEM bash -c "{ echo I am running on:; hostname; echo firstCommand; echo secondCommand; } && touch myJob.success"
sleep 5 # wait slurm get the job status into its database
echo Job done. Summary:
sacct --format=JobID,Submit,Start,End,State,Partition,ReqTRES%30,CPUTime,MaxRSS,NodeList%30 --units=M -j $SLURM_JOBID
sh sendJobFinishEmail.sh myJob
[ -f myJob.success ] && exit 0 || exit 1
Create a file sendJobFinishEmail.sh with:Â
#!/bin/bash
to=`cat ~/.forward`
flag=$1
minimumsize=9000
actualsize=`wc -c $flag.out`
[ ! -f $flag.success ] && s="Subject: Failed: job id:$SLURM_JOBID name:$SLURM_JOB_NAME\n" || s="Subject: Success: job id:$SLURM_JOBID name:$SLURM_JOB_NAME\n"
stat=`tail -n 1 $flag.out`
[[ "$stat" == *COMPLETED* ]] && echo *Notice the sacct report above: while the main job is still running for sacct command, user task is completed. >> $flag.out
if [ "${actualsize% *}" -ge "$minimumsize" ]; then
toSend=`echo Job script content:; cat $flag.sh`
toSend="$s\n$toSend\nOutput is too big for email. Please find output in: $flag.out"
toSend="$toSend\n...\n`tail -n 6 $flag.out`"
else
toSend=`echo Job script content:; cat $flag.sh; echo Job output:; cat $flag.out`
toSend="$s\n$toSend"
fi
echo -e "$toSend" | sendmail $to
Then submit with (notice here, SRUNTIME Is 1 minute less than sbatch time and SRUNMEM is 1M less than sbatch mem.This is to make sure srun will not use more all the resource, so sacct and email commands can run.):Â
export SRUNTIME=0:9:0; export SRUNMEM=500M; sbatch -p short -t 0:10:0 --mem 501M -o myJob.out -e myJob.out myJob.sh
Let us know if you have any questions. Please include your working folder and commands used in your email. Any comment and suggestion are welcome!