NOTICE: FULL O2 Cluster Outage, January 3 - January 10th

O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10

  • on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
  • on Jan 3 (6:00 PM): O2 systems will start being powered off.

This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.

Specifically:

  • The O2 Cluster will be completely offline, including O2 Portal.
  • All data on O2 will be inaccessible.
  • Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
  • Websites on O2 will be completely offline, including all web content.

More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation

Tensorflow on O2

Due to recent developments in deep learning and related topics, tensorflow and its components have been widely requested by the user community. Due to the nature of the package however, we have decided it is best for the user to manage their own installation to ensure that they can quickly modify or upgrade tensorflow to their own needs without waiting for Research Computing to handle version changes. This page therefore serves to provide basic instructions on how to install tensorflow without elevated privileges, into a local directory that is owned by the user.

Basic Installation

Early on, tensorflow was quite difficult to install into a shared computing environment such as O2. However, the developers have since made it far friendlier to set up for the average user that is not computing locally. All that needs to be done is to invoke pip to complete the installation.

Tensorflow is compatible with both Python 3 as of the writing of this document. In order to install it, it is strongly recommended to set up a virtual environment.

First, request an interactive session:

$ srun --pty -t 2:0:0 --mem=2G -p interactive bash

If you are planning to use Tensorflow immediately after installing it inside of the interactive session, it may be wise to increase the memory requirement when submitting the request.

Once you're on a compute node, load the prerequisite module:

$ module load gcc/9.2.0

This will expose the available python 3 modules to you. It is strongly recommended to use a version that is at least 3.8; we will use 3.9.14 in this example:

$ module load python/3.9.14

If you are planning to use GPU resources, you also need to load the (latest) cuda module. For example:

Once you have confirmed that gcc , python, and possibly CUDA, are loaded, create a virtual environment (instructions replicated here from Personal Python Packages):

where /path/to/nameofenv is your chosen name and location of the environment you'd like to install to. (/path/to should already exist.) Once the environment is created, you'll want to turn it on.

After this, your prompt should look something like this:

From here, you should be ready to install Tensorflow.

Recent builds of tensorflow package both CPU and GPU components.

If you plan to use the gpu (or related GPU-enabled) partition, a couple more steps are required for you to set up your code to leverage GPUs.

Tensorflow and the gpu partition

If you installed tensorflow-gpu, your first order of business should be to familiarize yourself with Using O2 GPU resources. This page informs on how to submit jobs to the gpu partition and request GPUs for your jobs. Once you are ready to submit your job, say, model.py, you need to make sure a couple of additional resources are loaded: namely, CUDA and CuDNN. These are libraries that allow Tensorflow to interface with the GPU and leverage its capabilities. Currently, O2's GPU nodes support CUDA 9.0. When Tensorflow supports CUDA 10.0, we will be upgrading the drivers on the GPU nodes to work with CUDA 10.0.

When you submit your job to the gpu partition, make sure GCC and the correct python module are loaded, and your virtual environment is active. If not, run the corresponding commands in that exact order:

Then, you also need to load the same CUDA module as the one you used to build Tensorflow, e.g.:

(CuDNN is included with each of our CUDA modules.)

If you are in an interactive session (on a GPU node), you can now start running your code. If you plan to submit a batch job, place all of the above commands (with the choice of python module) into your submission script. From here, you should be all set! Make sure you fully understand the Using O2 GPU resources page before submitting to the partition, as there are a limited amount of resources available, and pend time is highly variable depending on current demand.

Installation with (ana)conda

If you are familiar with the conda package manager (or the Anaconda environment system), it is also possible to install Tensorflow this way (and depending on your project requirements, this may be the best way to handle your Tensorflow installation). If you have your own ana/conda installation, skip to after the module load command.

First, create an interactive session as before:

To access the O2 conda module, simply type

Then, it is highly recommended that you create a new environment for Tensorflow (or your project that happens to use Tensorflow):

Now, we activate the environment:

You should see your terminal get modified with the environment name in parentheses just as above using virtualenv. If this command fails, you may need to specify the full path to the environment (especially if you used --prefix instead of --name to create it). From here, you can either use pip as above, or install Tensorflow with conda:

Once this completes, you should be ready to use Tensorflow! Keep in mind that all of the above stipulations regarding leverage of hardware resources using the virtualenv process still apply to this method of installation and usage.

Basic Troubleshooting

Depending on when you installed your copy of Tensorflow, you may see something like this when you run your code:

Loaded runtime CuDNN library: 7.0.4 but source was compiled with: 7.2.1. CuDNN ... Segmentation Fault

This is because the version of Tensorflow you installed used a newer version of CuDNN than the one that was found on the cluster. To fix this, you'll need to download a newer version of CuDNN and place it somewhere in a directory you own. You can submit a ticket with us if you'd like assistance with this issue.

Newer CUDA modules will have the associated newest CuDNN libraries included.