OpenFold On O2
OpenFold https://github.com/aqlaboratory/openfold is a “faithful but trainable PyTorch reproduction of DeepMind’s Alphafold2”. It is a tool for predicting protein structures. OpenFold is now available as an experimental module. To install a local copy for yourself, please see Installing OpenFold Locally on O2.
Compared to AlphaFold (Using AlphaFold on O2) OpenFold takes about the same amount of parameters, and can use either jackhmmer
or mmseqs2
for the multiple sequence alignment (MSA) step of analysis. OpenFold is generally faster than AlphaFold (up to 2x) and adept at processing extremely long chains (4000+ residues).
Note: If you’re new to Slurm or O2, please see Using Slurm Basic for lots of information on submitting jobs.
Before starting: if you are working with a single protein, check to see if it has not already been previously computed. A full database can be found at https://alphafold.ebi.ac.uk/. It may save you a lot of time!
Loading the OpenFold Module
OpenFold has many features and modes that are more thoroughly described on the repository site. Below, we will focus on a simple (single protein) folding example to show how it can be run on O2. To access the module, run:
$ module load gcc/9.2.0 openfold/1.0.1
Once you have loaded this module, you’ll want to submit your job to the gpu
(or gpu_quad
if you have access) partition so that you can leverage GPU resources (Using O2 GPU resources ). To see all the parameters related to OpenFold’s main function try running the following after loading the module:
$ python3 $OPENFOLDDIR/openfold/run_pretrained_openfold.py -h
At this time running proteins with relaxation causes errors. Always include the option --skip_relaxation
when running OpenFold
.fasta
files are analyzed in a batched fashion, meaning that whole directories can be processed in one run. On O2, .fasta
files will be aligned using jackhmmer
by default. Openfold can accept msas from either jackhmmer or mmseqs2 as long as they are presented in a format similar to the steps found in the *msa generation section*. A typical .fasta
file will contain only one protein sequence:
>header1
(insert amino acid sequence here)
.fasta
multimer files are created by adding a header for each protein sequence and passing the option --multimer_ri_gap 200
along with the run_pretrained_openfold.py
command, described more in OpenFold On O2 | Running OpenFold and below:
Complexes are run using AlphaFold-Gap, a hack described more in a thread here, and uses stock AlphaFold/OpenFold parameters. OpenFold devs provide an experimental multimer branch of Openfold, it will have to be manually installed using the instructions for Installing OpenFold Locally on O2 .
Generating MSAs with Jackhmmer
or MMSeqs2
using OpenFold Scripts
OpenFold presents 2 options for generating MSAs, below we will adapt the code presented in the readme to O2. The first example is using jackhmmer
. Note that this script outputs help information by running:
We can submit a sbatch to generate the msas from input.fasta
:
This script should generate a series of files in the output dir, including the msa files. The second method uses mmseqs2
to generate the MSA. Similar to the jackhmmer
method, we can get help information by running:
Below is an example of a job script that uses mmseqs2
to run the alignment step.
These outputs can be used in the next step by adding the option --use_precomputed_alignments
with the path to your msa directory, for example:
Running OpenFold
We have public databases available in /n/shared_db/ (Public Databases) and we can use these in the Openfold run command:
Note: The input line /PATH/TO/FASTA/DIR/
must point to a directory that contains only .fasta
files.
Monomer (single protein) Job Example:
Multimer (protein complex) Job Example:
The output will contain temporary files (.fasta
, .json
), along with a predictions
directory that contains .pdb
files.