Tensorflow on O2

Due to recent developments in deep learning and related topics, tensorflow and its components have been widely requested by the user community. Due to the nature of the package however, we have decided it is best for the user to manage their own installation to ensure that they can quickly modify or upgrade tensorflow to their own needs without waiting for Research Computing to handle version changes. This page therefore serves to provide basic instructions on how to install tensorflow without elevated privileges, into a local directory that is owned by the user.

Basic Installation

A year ago, tensorflow was quite difficult to install into a shared computing environment such as O2. However, the developers have since made it far friendlier to set up for the average user that is not computing locally. All that needs to be done is to invoke pip to complete the installation.

Tensorflow is compatible with both Python 2 and 3 as of the writing of this document, and comes in two flavors: tensorflow and tensorflow-gpu. In order to install either of them, it is strongly recommended to set up a virtual environment.

First, request an interactive session:

1 $ srun --pty -t 2:0:0 --mem=2G -p interactive bash

If you are planning to use Tensorflow immediately after installing it inside of the interactive session, it may be wise to increase the memory requirement when submitting the request.

Once you're on a compute node, load the prerequisite modules:

1 $ module load gcc/6.2.0

This will expose the available python modules to you. From here, you have two choices.

For Python 2:

1 $ module load python/2.7.12-ucs4

The other python 2 module will NOT work. This means that if you already have an existing python virtual environment, you may still need to create a new one for Tensorflow.

For Python 3, you can use any version we offer. For example:

1 $ module load python/3.6.0

If you are planning to install/use the GPU version, you also need to load the (latest) cuda module. For example:

1 $ module load cuda/10.2

Once you have confirmed that gcc and one of the above two python modules, and possibly CUDA, are loaded, create a virtual environment (instructions replicated here from Personal Python Packages):

1 $ virtualenv /path/to/nameofenv --system-site-packages

where /path/to/nameofenv is your chosen name and location of the environment you'd like to install to. (/path/to should already exist.) If you'd like to install fresh components for tensorflow, feel free to omit the --system-site-packages flag. Once the environment is created, you'll want to turn it on.

1 $ source /path/to/nameofenv/bin/activate

After this, your prompt should look something like this:

1 (nameofenv)$

From here, you should be ready to install Tensorflow.

As mentioned above, there are two flavors: tensorflow and tensorflow-gpu. tensorflow-gpu can be thought of as a superset of tensorflow: tensorflow will work only on CPU, while tensorflow-gpu will work on both CPU and GPU, but will give GPU priority depending on the implementation of your code. For ease of use (and/or if you have no plans to use GPU), install the regular version. For full functionality (including planned usage of the gpu partition), install the GPU-enabled version, but be aware that if you have installed tensorflow-gpu but are planning to leverage CPU resources (for whatever reason), you will still need to submit to the gpu partition. It may be worth having two separate installations for GPU-enabled and CPU-exclusive Tensorflow depending on your personal requirements so that you can submit to other partitions if you don't need GPU resources for some parts of your work.

So, choose one:

1 (nameofenv)$ pip install tensorflow

for CPU-enabled, or:

1 (nameofenv)$ pip install tensorflow-gpu

for the GPU- and CPU-enabled version.

If you chose the first option, you're done! You can immediately begin coding with Tensorflow.

If you chose the second option, and plan to use the gpu partition, a couple more steps are required for you to set up your code to leverage GPUs.

Tensorflow and the gpu partition

If you installed tensorflow-gpu, your first order of business should be to familiarize yourself with Using O2 GPU resources. This page informs on how to submit jobs to the gpu partition and request GPUs for your jobs. Once you are ready to submit your job, say, model.py, you need to make sure a couple of additional resources are loaded: namely, CUDA and CuDNN. These are libraries that allow Tensorflow to interface with the GPU and leverage its capabilities. Currently, O2's GPU nodes support CUDA 9.0. When Tensorflow supports CUDA 10.0, we will be upgrading the drivers on the GPU nodes to work with CUDA 10.0.

When you submit your job to the gpu partition, make sure GCC and the correct python module are loaded, and your virtual environment is active. If not, run the corresponding commands in that exact order:

1 2 3 4 5 6 7 $ module load gcc/6.2.0 # then either of $ module load python/2.7.12-ucs4 # OR $ module load python/3.6.0 # then $ source /path/to/nameofenv/bin/activate

Then, you also need to load the same CUDA module as the one you used to build Tensorflow, e.g.:

1 (nameofenv)$ module load cuda/10.2

if you are in an interactive session (on a GPU node), you can now start running your code. If you plan to submit a batch job, place all of the above commands (with the choice of python module) into your submission script. From here, you should be all set! Make sure you fully understand the Using O2 GPU resources page before submitting to the partition, as there are a limited amount of resources available, and pend time is highly variable depending on current demand.

Installation with (ana)conda

If you are familiar with the conda package manager (or the Anaconda environment system), it is also possible to install Tensorflow this way (and depending on your project requirements, this may be the best way to handle your Tensorflow installation). If you have your own ana/conda installation, skip to after the module load command.

First, create an interactive session as before:

1 $ srun --pty -t 2:0:0 --mem=2G -p interactive bash

To access the O2 conda module, simply type

1 2 3 # for safety, dump your current modules first $ module purge $ module load conda2

Then, it is highly recommended that you create a new environment for Tensorflow (or your project that happens to use Tensorflow).

1 $ conda create --name tensorflowenv python=3.6

You can substitute the name of the environment with any name of your choosing, as well as pick an appropriate version of Python. Alternatively, you can use --prefix instead of --name, and specify the exact location of the environment, as well.

Now, we activate the environment:

1 $ source activate tensorflowenv

You should see your terminal get modified with the environment name in parentheses just as above using virtualenv. If this command fails, you may need to specify the full path to the environment (especially if you used --prefix instead of --name to create it). From here, you can either use pip as above, or install Tensorflow with conda:

1 2 3 (tensorflowenv)$ conda install tensorflow # OR (tensorflowenv)$ conda install tensorflow-gpu

Once this completes, you should be ready to use Tensorflow! Keep in mind that all of the above stipulations regarding leverage of hardware resources using the virtualenv process still apply to this method of installation and usage.

Basic Troubleshooting

Depending on when you installed your copy of Tensorflow, you may see something like this when you run your code:

Loaded runtime CuDNN library: 7.0.4 but source was compiled with: 7.2.1. CuDNN ... Segmentation Fault

This is because the version of Tensorflow you installed used a newer version of CuDNN than the one that was found on the cluster. To fix this, you'll need to download a newer version of CuDNN and place it somewhere in a directory you own. You can submit a ticket with us if you'd like assistance with this issue.