NOTICE: FULL O2 Cluster Outage, January 3 - January 10th

O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10

  • on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
  • on Jan 3 (6:00 PM): O2 systems will start being powered off.

This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.

Specifically:

  • The O2 Cluster will be completely offline, including O2 Portal.
  • All data on O2 will be inaccessible.
  • Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
  • Websites on O2 will be completely offline, including all web content.

More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation

OpenFold On O2

OpenFold GitHub - aqlaboratory/openfold: Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2 is a “faithful but trainable PyTorch reproduction of DeepMind’s Alphafold2”. It is a tool for predicting protein structures. OpenFold is now available as an experimental module. To install a local copy for yourself, please see Installing OpenFold Locally on O2.

Compared to AlphaFold (Using AlphaFold on O2) OpenFold takes about the same amount of parameters, and can use either jackhmmer or mmseqs2 for the multiple sequence alignment (MSA) step of analysis. OpenFold is generally faster than AlphaFold (up to 2x) and adept at processing extremely long chains (4000+ residues).

Note: If you’re new to Slurm or O2, please see Using Slurm Basic for lots of information on submitting jobs.

 


Before starting: if you are working with a single protein, check to see if it has not already been previously computed. A full database can be found at AlphaFold Protein Structure Database. It may save you a lot of time!

Loading the OpenFold Module

OpenFold has many features and modes that are more thoroughly described on the repository site. Below, we will focus on a simple (single protein) folding example to show how it can be run on O2. To access the module, run:

$ module load gcc/9.2.0 openfold/1.0.1

Once you have loaded this module, you’ll want to submit your job to the gpu (or gpu_quad if you have access) partition so that you can leverage GPU resources (Using O2 GPU resources ). To see all the parameters related to OpenFold’s main function try running the following after loading the module:

$ python3 $OPENFOLDDIR/openfold/run_pretrained_openfold.py -h

At this time running proteins with relaxation causes errors. Always include the option --skip_relaxation when running OpenFold

.fasta files are analyzed in a batched fashion, meaning that whole directories can be processed in one run. On O2, .fasta files will be aligned using jackhmmer by default. Openfold can accept msas from either jackhmmer or mmseqs2 as long as they are presented in a format similar to the steps found in the *msa generation section*. A typical .fasta file will contain only one protein sequence:

>header1 (insert amino acid sequence here)

.fasta multimer files are created by adding a header for each protein sequence and passing the option --multimer_ri_gap 200 along with the run_pretrained_openfold.py command, described more in OpenFold On O2 | Running OpenFold and below:

Complexes are run using AlphaFold-Gap, a hack described more in a thread here, and uses stock AlphaFold/OpenFold parameters. OpenFold devs provide an experimental multimer branch of Openfold, it will have to be manually installed using the instructions for Installing OpenFold Locally on O2 .

Generating MSAs with Jackhmmer or MMSeqs2 using OpenFold Scripts

OpenFold presents 2 options for generating MSAs, below we will adapt the code presented in the readme to O2. The first example is using jackhmmer. Note that this script outputs help information by running:

We can submit a sbatch to generate the msas from input.fasta:

This script should generate a series of files in the output dir, including the msa files. The second method uses mmseqs2 to generate the MSA. Similar to the jackhmmer method, we can get help information by running:

Below is an example of a job script that uses mmseqs2 to run the alignment step.

These outputs can be used in the next step by adding the option --use_precomputed_alignments with the path to your msa directory, for example:

Running OpenFold

We have public databases available in /n/shared_db/ (Public Databases) and we can use these in the Openfold run command:

Note: The input line /PATH/TO/FASTA/DIR/ must point to a directory that contains only .fasta files.

Monomer (single protein) Job Example:

Multimer (protein complex) Job Example:

The output will contain temporary files (.fasta, .json), along with a predictions directory that contains .pdb files.