NOTICE: FULL O2 Cluster Outage, January 3 - January 10th
O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10
- on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
- on Jan 3 (6:00 PM): O2 systems will start being powered off.
This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.
Specifically:
- The O2 Cluster will be completely offline, including O2 Portal.
- All data on O2 will be inaccessible.
- Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
- Websites on O2 will be completely offline, including all web content.
More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation
How to connect to your sbatch running job
Slurm allows you to connect, with a standard terminal shell
, to the nodes where your jobs are running. From those shell
connections you will be able to use the same computational resources allocated for your jobs. You can also run a command, directly from login nodes, against resources allocated to a separate already running job.
This feature can be useful for:
monitoring in real time your sbatch running jobs (you can use commands like "top" or “ps” to check your processes or look at the content of local /tmp folders on the compute node)
using resources allocated to a running sbatch job that you know might be idle at a given time. For example use GPU or CPU computing power when temporarily idle.
Get immediate access to some computational resources if you are in a urgent need. For example, you might be running a GPU jobs and still have enough VRAM (GPU memory) free on that card that could be used to run a separate process.
The syntax to use is srun --jobid=<jobid_number> ...
followed by your desired command.
The shell or the commands started via srun --jobid
will be constrained to use only the resources available to the jobid
job.
This below is an example of how you can connect from a login node to a running sbatch job.
First, we had submitted a standard sbatch job, in this example called my_sbatch_job.sh:
#!/bin/bash
#SBATCH -c 1 # Number of cores requested
#SBATCH -t 4:00:00 # Wall-time
#SBATCH -p priority # Partition
#SBATCH --mem=2G # memory per node
# your sbatch job commands here
python3 python_sleep.py
@login03:SLURM sbatch my_sbatch_job.sh
Submitted batch job 10610731
Once the sbatch job starts running, it is possible to start a shell
as a Slurm jobstep using the same resources allocated for the sbatch job (10610731 in this example).
@login03:~ srun --jobid=10610731 --pty bash
@compute-e-16-233:~
Everything executed within that srun shell will run sharing the same resources already allocated for the sbatch job.
It is possible to connect multiple times to the same sbatch job, as long as the sbatch job is in “RUNNING” state. However, it is not possible to have concurrent srun --jobid
connections to the same sbatch running job.
It is also possible to run directly non-interactive commands using the srun jobstep, for example:
Where the Linux command hostname
was executed on the compute node where job 10610731 was dispatched and using the resources allocated for that job.
Note:
Any command you run from srun --jobid=<jobid_number>
will share the resources allocated for the running sbatch job, so your command will compete against the same CPU, RAM and GPU resources used by the sbatch job.