NOTICE: FULL O2 Cluster Outage, January 3 - January 10th

O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10

  • on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
  • on Jan 3 (6:00 PM): O2 systems will start being powered off.

This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.

Specifically:

  • The O2 Cluster will be completely offline, including O2 Portal.
  • All data on O2 will be inaccessible.
  • Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
  • Websites on O2 will be completely offline, including all web content.

More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation

RoseTTAFold All-Atom on O2

HMS IT has received numerous requests for RoseTTAFold All-Atom to be installed as a global module to be accessed via the module load system. Unfortunately, due to the way that this application is configured, and its general unfriendliness to shared environments, we are unable to offer it at this time.

Below is an outline for how to install RoseTTAFold All-Atom locally (i.e., into a home directory or group directory). We can make no guarantees as to runtime correctness (or even that it will run at all), but these instructions are confirmed to result in a “complete” installation of RoseTTAFold All-Atom. Any errors related to post-installation configuration may need to be raised with the developers, but users are welcome to contact rchelp@hms.harvard.edu to initially triage any issues with the process.

This document is designed to be read and followed from top to bottom. If you skip around, troubleshooting may become difficult - we recommend users that are stuck mid-way to delete everything and start over from the very beginning of the process and see if their error persists. (The Table of Contents above is provided as a courtesy.)

Prerequisites to the prerequisites

These instructions are adapted from those provided via the RoseTTAFold All-Atom README file.

First, determine where to install RoseTTAFold All-Atom. This location will involve two main components:

  1. the conda environment

  2. a git clone of the repository.

We will use /path/to/install as a placeholder for this location. First, create (if necessary) and navigate to this directory:

mkdir -p /path/to/install

cd /path/to/install

From here, clone the repository and navigate into it:

git clone https://github.com/baker-laboratory/RoseTTAFold-All-Atom.git

cd RoseTTAFold-All-Atom

Now, make sure you have a conda distribution available, with mamba installed. We will use our miniconda3/23.1.0 module here.

module load miniconda3/23.1.0

Note that if you have additional modules loaded (such as modules that load on login), you may encounter conflicts and errors, so running module purge to unload all modules first is recommended before loading the miniconda module.

Installation of the conda environment

Currently, you should be in /path/to/install/RoseTTAFold-All-Atom.

Before creating the environment, you need to open the environment.yaml file in a text editor, such as nano, and comment out all instances of tensorflow. This includes:

tensorflow-base tensorflow-estimator tensorflow

Navigate to the lines in the file where these three packages are specified, and insert a # at the front of each line, then save and exit the text editor.

Now, with access to a mamba-enabled conda distribution, run the following command:

mamba env create -f environment.yaml

This will by default create a conda environment named RFAA in $HOME/.conda/envs/RFAA. If you would rather the environment live elsewhere (such as /path/to/install, you can run this command instead:

mamba env create -p /path/to/install/RFAA -f environment.yaml

And this will create the conda environment at /path/to/install/RFAA instead.

Activate the environment:

source activate /path/to/install/RFAA

if installed to /path/to/install (via the -p flag) or just

source activate RFAA

if installed to the default location (using the first command).

Next, manually pip3 install the tensorflow components that were commented out of the environment.yaml file. At the time of writing, these packages are version 2.11.0, so that’s what we install:

pip3 install tensorflow==2.11.0 tensorflow-estimator==2.11.0

You may see that they’ve already been installed (along with the requisite dependencies). You can confirm this by running the following:

python3 -c "from tensorflow import estimator"

This command will verify that 1) tensorflow is installed, and 2) that the estimator module is accessible. tensorflow-base is just an anaconda-specific method of packaging only the primary components of tensorflow, but since we installed it via pip3 instead, we don’t need to worry about it.

Installation of the prerequisites

Now we install the separated dependencies that are specified by the README.md file.

We follow the example (at the time of writing) and grab the “fast” variant of the tarball at this location:

https://services.healthtech.dtu.dk/services/SignalP-6.0/

(Click “Downloads”, click “Fast”, then accept the license to download the package.)

Move this package onto O2 into the /path/to/install location via whatever file transfer protocol operation you are familiar with. We have a page that outlines popular methods of file transfer. In our terminal session, we navigate back to this folder:

cd ..

(if you’ve been doing some other stuff in this terminal, you may need to do cd /path/to/install explicitly.)

As instructed, you should have a file here that is called something like signalp-6.0h.fast.tar.gz. Run the following command:

signalp-register signalp-6.0h.fast.tar.gz

Then:

mv $CONDA_PREFIX/lib/python3.10/site-packages/signalp/model_weights/distilled_model_signalp6.pt $CONDA_PREFIX/lib/python3.10/site-packages/signalp/model_weights/ensemble_model_signalp6.pt

to rename the weights. Finally, you can run the following:

bash RoseTTAFold-All-Atom/install_dependencies.sh

The model weights are already available centrally and are available at:

/n/shared_db/RoseTTAFold/All-Atom_weights

You may choose to download them yourself anyway, with the following command:

wget http://files.ipd.uw.edu/pub/RF-All-Atom/weights/RFAA_paper_weights.pt

See the Databases section regarding databases.

Finally, follow the instructions to install the specified version of BLAST:

wget https://ftp.ncbi.nlm.nih.gov/blast/executables/legacy.NOTSUPPORTED/2.2.26/blast-2.2.26-x64-linux.tar.gz mkdir -p blast-2.2.26 tar -xf blast-2.2.26-x64-linux.tar.gz -C blast-2.2.26 cp -r blast-2.2.26/blast-2.2.26/ blast-2.2.26_bk rm -r blast-2.2.26 mv blast-2.2.26_bk/ blast-2.2.26

At this point, /path/to/install should have several directories/files:

  1. the conda environment (RFAA), if installed using the -p flag (otherwise, this lives at $HOME/.conda/envs/RFAA instead)

  2. the git repository (RoseTTAFold-All-Atom)

  3. a cs-blast-2.2.3 directory, created by install_dependencies.sh above

  4. a blast-2.2.26 directory, created by following the instructions above

  5. RFAA_paper_weights.pt (if you chose to download this yourself instead of using the one in /n/shared_db)

Databases

Databases do not need to be manually installed - they are accessible at:

/n/shared_db/RoseTTAFold

Feel free to browse this folder to confirm specific database types.

Post-installation configuration

HMS IT will not be able to offer much guidance here - configuration details will be dependent on the user workflow and data to be processed.

We can provide some suggestions for defaults, however. The main configuration files are located at /path/to/install/RoseTTAFold-All-Atom/rf2aa/config/inference.

We provide the following suggested paramters for the base.yaml file specifically:

checkpoint_path: RFAA_paper_weights.pt

should become:

checkpoint_path: "/path/to/install/RFAA_paper_weights.pt"

if downloaded locally, or

checkpoint_path: "/n/shared_db/RoseTTAFold/All-Atom_weights/RFAA_paper_weights.pt"

if using the central file.

sequencedb:

should point to the appropriate database in /shared_db/RoseTTAFold.

hhdb: "pdb100_2021Mar03/pdb100_2021Mar03"

should become:

hhdb: "/n/shared_db/RoseTTAFold/pdb100_2021Mar03/pdb100_2021Mar03"

It should be noted that the default working directory for invoking the All-Atom workflow appears to be at the root of the GitHub repository (that is, /path/to/install/RoseTTAFold-All-Atom). The other configuration files may need to be modified accordingly. The above changes are necessary because the default instructions set up everything inside this root level. If you’d rather do this, you may decide to cd directly into this folder before executing the commands specified in the Installation of the prerequisites section. This will result in the following directories and files existing within /path/to/install/RoseTTAFold-All-Atom instead of at /path/to/install as specified previously:

  1. the cs-blast-2.2.3 directory

  2. the blast-2.2.26 directory

  3. RFAA_paper_weights.pt

The databases would live here as well, but due to the large storage footprint required, we strongly recommend using the files living at /n/shared_db/RoseTTAFold instead.

Execution

Note the specifications in base.yaml for num_cpus and mem; this will dictate the SLURM resources requested for O2 jobs. You may also wish to change these values accordingly.

At this point, there are various numbers of ways for things to go horribly wrong; please contact rchelp@hms.harvard.edu with your questions (and please provide terminal output as well as installation locations). We can attempt to assist, but may ultimately point you toward creating an issue on the GitHub repository.