NOTICE: FULL O2 Cluster Outage, January 3 - January 10th

O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10

  • on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
  • on Jan 3 (6:00 PM): O2 systems will start being powered off.

This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.

Specifically:

  • The O2 Cluster will be completely offline, including O2 Portal.
  • All data on O2 will be inaccessible.
  • Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
  • Websites on O2 will be completely offline, including all web content.

More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation

MATLAB Parallel jobs using the custom O2 cluster profile

It is possible to configure MATLAB so that it interacts with the SLURM scheduler. This allows MATLAB to directly submit parallel jobs to the SLURM scheduler and enables it to leverage cpu and memory resources across different nodes (distributed memory).

To do so you first need to configure the O2 cluster profile in the MATLAB version being used which is done running the command configCluster

 

NOTE: 

It is strongly recommended to use MATLAB version 2019a or later when submitting multi-node jobs (mpi partition) with MATLAB O2 cluster profile. Earlier versions of MATLAB are using a mechanism to start the MATLAB workers that is not fully compatible with our existing SLURM epilog and could cause jobs to be killed.

Setting up the O2 MATLAB Cluster Profile 

 

>> configCluster Must set QueueName and WallTime before submitting jobs to O2. E.g. >> c = parcluster; >> c.AdditionalProperties.QueueName = 'queue-name'; >> % 5 hours >> c.AdditionalProperties.WallTime = '05:00:00'; >> c.saveProfile Complete. Default cluster profile set to "o2 R2023a".

now your default cluster profile is set to o2 local R2023a and you should be able to verify it by running the command parcluster

>> parcluster ans = Generic Cluster Properties: Profile: o2 R2023a Modified: false Host: compute-a-16-161 NumWorkers: 100000 NumThreads: 1 JobStorageLocation: /home/abc/.matlab/3p_cluster_jobs/o2/R2023a/shared ClusterMatlabRoot: /n/app/matlab/2023a-v2 OperatingSystem: unix RequiresOnlineLicensing: false PreferredPoolNumWorkers: 32 PluginScriptsLocation: /n/app/matlab/support-packages/matlab-parallel-server/scripts/IntegrationScripts/o2 AdditionalProperties: List properties Associated Jobs: Number Pending: 0 Number Queued: 0 Number Running: 0 Number Finished: 0 >>



Note 1:  The configCluster command needs to be executed only on time

Note 2:  After running the configCluster command, the default cluster profile is set to the O2 cluster; if you want to go back and use the "local" cluster profile, you can change the default profile using the command parallel.defaultClusterProfile('local')

Note 3: Running the configCluster command sets the cluster profile only for the currently used MATLAB version. If later on you use a different version of MATLAB you will need to run configCluster again.



Setting the submission parameter for the O2 MATLAB cluster profile 



In order to use the O2 MATLAB cluster profile it is required to define at least two submission parameters: the partition to be used and the desired wall-time. In MATLAB 2016b this can be done with the command ClusterInfo.set+Property for example:

>> ClusterInfo.setQueueName('mpi') >> ClusterInfo.setWallTime('48:00') >>

Note: In the above example the partition "mpi" is used to set the parameter ClusterInfo.setQueueName, however the MATLAB O2 Cluster Profile can be used with any of the partitions available on the O2 cluster.

Several other parameter can be defined in a similar way, this below is the complete list available: 

The command ClusterInfo.setUserDefinedOptions can be used to pass additional flag to the scheduler. For example ClusterInfo.setUserDefinedOptions('-o output.log') will pass the flag -o output.log to the scheduler when submitting a job from within MATLAB. Similarly the command ClusterInfo.get+Property can be used to check the assigned Property

Note that, once assigned, each property will be saved in the user ~/.matlab profile folder and will not need to be re-defined unless a change is desired (i.e. different wall-time, partition, amount of memory, etc.)





Define job submission flags for Version ≥ R2017a 

In order to use the O2 MATLAB cluster profile it is required to define at least two submission parameters: the partition to be used and the desired wall-time. This can be done assigning the properties directly to a parcluster object c as shown in the example below:

Note that set parameters will not be retained by default and must be re-entered if the c object is deleted. To save permanently the submission parameter you must execute the command c.saveProfile

Important: Use --mem-per-cpu (or the flag c.AdditionalProperties.MemPerCPU) instead of --mem to request a custom amount of memory when using the mpi partition. The slurm flag --mem is used to request a given amount of memory per node, so unless you are enforcing a balanced distribution of tasks (MATLAB workers) per node, you might end up with too much or not enough memory on a given node, depending on how the tasks are allocated.



Using the O2 MATLAB Cluster Profile 

Parpool() command

One way to use the O2 MATLAB cluster profile is to request a parallel pool of N_c workers with the command parpool(N_c). MATLAB will submit a slurm job request for the requested number of cores N_c and will start the parallel pool once the requested cores are allocated. For example requesting a parallel pool of 3 cores would look like:

Then any parallel part of a script will be executed on the parallel workers allocated with the parpool command, for example:

Note 1: If you run a non interactive parallel job using the parpool() command with the O2 cluster profile you will actually be dispatching two jobs. First a serial job (1 core) to start your MATLAB script (i.e. matlab -nodesktop -r "my_function") and then a second parallel job that will be submitted directly within MATLAB once the execution of  my_function.f reaches the parpool() command. To avoid this double-job condition you can use the command batch described later in this page

Note 2: The command parpool cannot be executed if MATLAB has been started on login nodes. Make always sure to start interactive MATLAB sessions from within interactive jobs (on actual compute nodes instead of login nodes)

Batch Command

Similarly it is also possible to dispatch a batch of parallel jobs directly from within MATLAB using the command batch. In the example below we will submit to the cluster the simple parallel sleep function:

 starting 3 parallel jobs with 2,5 and 10 cores. In each job running the above function with the input parameter of 20 (i.e. sleep for 20 seconds) and measure the real elapsed time:

Note 1: When using the function batch to run a parallel function you will not need to add explicitly the command parpool inside the parallel function being executed (see the sleep.m example above)

Note 2: the function batch can also be used to submit non-parallel jobs, for example: