OpenFold On O2

OpenFold https://github.com/aqlaboratory/openfold is a “faithful but trainable PyTorch reproduction of DeepMind’s Alphafold2”. It is a tool for predicting protein structures. OpenFold is now available as an experimental module. To install a local copy for yourself, please see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2435809318.

Compared to AlphaFold (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1995177985) OpenFold takes about the same amount of parameters, and can use either jackhmmer or mmseqs2 for the multiple sequence alignment (MSA) step of analysis. OpenFold is generally faster than AlphaFold (up to 2x) and adept at processing extremely long chains (4000+ residues).

Note: If you’re new to Slurm or O2, please see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632 for lots of information on submitting jobs.

 


Before starting: if you are working with a single protein, check to see if it has not already been previously computed. A full database can be found at https://alphafold.ebi.ac.uk/. It may save you a lot of time!

Loading the OpenFold Module

OpenFold has many features and modes that are more thoroughly described on the repository site. Below, we will focus on a simple (single protein) folding example to show how it can be run on O2. To access the module, run:

$ module load gcc/9.2.0 openfold/1.0.1

Once you have loaded this module, you’ll want to submit your job to the gpu (or gpu_quad if you have access) partition so that you can leverage GPU resources (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1629290761 ). To see all the parameters related to OpenFold’s main function try running the following after loading the module:

$ python3 $OPENFOLDDIR/openfold/run_pretrained_openfold.py -h

At this time running proteins with relaxation causes errors. Always include the option --skip_relaxation when running OpenFold

.fasta files are analyzed in a batched fashion, meaning that whole directories can be processed in one run. On O2, .fasta files will be aligned using jackhmmer by default. Openfold can accept msas from either jackhmmer or mmseqs2 as long as they are presented in a format similar to the steps found in the *msa generation section*. A typical .fasta file will contain only one protein sequence:

>header1 (insert amino acid sequence here)

.fasta multimer files are created by adding a header for each protein sequence and passing the option --multimer_ri_gap 200 along with the run_pretrained_openfold.py command, described more in https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2274623492/OpenFold+On+O2#Running-OpenFold and below:

Complexes are run using AlphaFold-Gap, a hack described more in a thread here, and uses stock AlphaFold/OpenFold parameters. OpenFold devs provide an experimental multimer branch of Openfold, it will have to be manually installed using the instructions for https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2435809318 .

>header1 COMPLEXPROTIENSEQUENCE1 >header2 COMPLEXPROTIENSEQUENCE2 >header3 COMPLEXPROTIENSEQUENCE3 >header4 COMPLEXPROTIENSEQUENCE4

Generating MSAs with Jackhmmer or MMSeqs2 using OpenFold Scripts

OpenFold presents 2 options for generating MSAs, below we will adapt the code presented in the readme to O2. The first example is using jackhmmer. Note that this script outputs help information by running:

$ python3 $OPENFOLDDIR/openfold/scripts/precompute_alignments.py -h

We can submit a sbatch to generate the msas from input.fasta:

#!/bin/bash #SBATCH -c 8 # Number of cores #SBATCH -t 0-8:00 # Runtime in D-HH:MM format #SBATCH -p short # Partition to run in #SBATCH --mem=32G # Memory total (for all cores) #SBATCH --mail-type=ALL # ALL email notification type #SBATCH --mail-user=<email_address> # Email to which notifications will be sent #SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j) #SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j) module load gcc/9.2.0 openfold/1.0.1 python3 $OPENFOLDDIR/openfold/scripts/precompute_alignments.py \ --uniref90_database_path /n/shared_db/alphafold/uniref90/uniref90.fasta \ --mgnify_database_path /n/shared_db/alphafold/mgnify/mgy_clusters_2018_12.fa \ --pdb70_database_path /n/shared_db/alphafold/pdb70 \ --uniclust30_database_path /n/shared_db/alphafold/uniref90/uniref90.fasta \ --bfd_database_path /n/shared_db/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --jackhmmer_binary_path $OPENFOLDDIR/openfold-conda/bin/jackhmmer \ --hhblits_binary_path $OPENFOLDDIR/openfold-conda/bin/hhblits \ --hhsearch_binary_path $OPENFOLDDIR/openfold-conda/bin/hhsearch \ --cpus_per_task 2 \ --kalign_binary_path $OPENFOLDDIR/openfold-conda/bin/kalign \ --mmcif_cache /n/shared_db/alphafold/pdb_mmcif/mmcif_files/ \ --raise_errors /PATH/TO/INPUT.FASTA /PATH/TO/OUTPUT/DIR/

This script should generate a series of files in the output dir, including the msa files. The second method uses mmseqs2 to generate the MSA. Similar to the jackhmmer method, we can get help information by running:

$ python3 $OPENFOLDDIR/openfold/scripts/precompute_alignments_mmseqs.py -h

Below is an example of a job script that uses mmseqs2 to run the alignment step.

#!/bin/bash #SBATCH -c 1 # Number of cores #SBATCH -t 0-8:00 # Runtime in D-HH:MM format #SBATCH -p short # Partition to run in #SBATCH --mem=128G # Memory total (for all cores) #SBATCH --mail-type=ALL # ALL email notification type #SBATCH --mail-user=<email_address> # Email to which notifications will be sent #SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j) #SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j) module load gcc/9.2.0 openfold/1.0.1 mmseqs2/14-7e284 python3 $OPENFOLDDIR/openfold/scripts/precompute_alignments_mmseqs.py /PATH/TO/INPUT.FASTA \ /n/shared_db/misc/mmseqs2/14-7e284 \ uniref30_2202_db \ /PATH/TO/OUTPUT/DIR/ \ mmseqs \ --hhsearch_binary_path $OPENFOLDDIR/openfold-conda/bin/hhsearch \ --env_db colabfold_envdb_202108_db \ --pdb70 /n/shared_db/alphafold/pdb70/pdb70

These outputs can be used in the next step by adding the option --use_precomputed_alignments with the path to your msa directory, for example:

#!/bin/bash #SBATCH --partition=gpu # Partition to run in #SBATCH --gres=gpu:1 # GPU resources requested #SBATCH -c 4 # Requested cores #SBATCH --time=0-8:00 # Runtime in D-HH:MM format #SBATCH --mem=32GB # Requested Memory #SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j) #SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j) #SBATCH --mail-type=ALL # ALL email notification type #SBATCH --mail-user=<email_address> # Email to which notifications will be sent module load gcc/9.2.0 openfold/1.0.1 python3 $OPENFOLDDIR/openfold/run_pretrained_openfold.py \ --skip_relaxation \ --output_dir /PATH/TO/OUTPUT/DIR/ \ --model_device "cuda:0" \ --use_precomputed_alignments /PATH/TO/MSA/DIR \ --config_preset "model_1_ptm" \ --openfold_checkpoint_path $OPENFOLDDIR/openfold/openfold/resources/openfold_params/finetuning_ptm_2.pt \ --cpus 4 \ /PATH/TO/INPUT/DIR \ /n/shared_db/alphafold/pdb_mmcif/mmcif_files/ \ --uniref90_database_path /n/shared_db/alphafold/uniref90/uniref90.fasta \ --mgnify_database_path /n/shared_db/alphafold/mgnify/mgy_clusters_2018_12.fa \ --pdb70_database_path /n/shared_db/alphafold/pdb70 \ --uniclust30_database_path /n/shared_db/alphafold/uniclust30/uniclust30_2018_08 \ --bfd_database_path /n/shared_db/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --jackhmmer_binary_path $OPENFOLDDIR/openfold-conda/bin/jackhmmer \ --hhblits_binary_path $OPENFOLDDIR/openfold-conda/bin/hhblits \ --hhsearch_binary_path $OPENFOLDDIR/openfold-conda/bin/hhsearch \ --kalign_binary_path $OPENFOLDDIR/openfold-conda/bin/kalign

Running OpenFold

We have public databases available in /n/shared_db/ (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1616511789) and we can use these in the Openfold run command:

Note: The input line /PATH/TO/FASTA/DIR/ must point to a directory that contains only .fasta files.

Monomer (single protein) Job Example:

#!/bin/bash #SBATCH --partition=gpu # Partition to run in #SBATCH --gres=gpu:1 # GPU resources requested #SBATCH -c 4 # Requested cores #SBATCH --time=0-8:00 # Runtime in D-HH:MM format #SBATCH --mem=32GB # Requested Memory #SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j) #SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j) #SBATCH --mail-type=ALL # ALL email notification type #SBATCH --mail-user=<email_address> # Email to which notifications will be sent module load openfold/1.0.1 python3 $OPENFOLDDIR/openfold/run_pretrained_openfold.py \ --skip_relaxation \ --output_dir /PATH/TO/OUTPUT/DIR/ \ --model_device "cuda:0" \ --config_preset "model_1_ptm" \ --openfold_checkpoint_path $OPENFOLDDIR/openfold/openfold/resources/openfold_params/finetuning_ptm_2.pt \ --cpus 4 \ /PATH/TO/INPUT/DIR/ \ /n/shared_db/alphafold/pdb_mmcif/mmcif_files/ \ --uniref90_database_path /n/shared_db/alphafold/uniref90/uniref90.fasta \ --mgnify_database_path /n/shared_db/alphafold/mgnify/mgy_clusters_2018_12.fa \ --pdb70_database_path /n/shared_db/alphafold/pdb70/pdb70 \ --uniclust30_database_path /n/shared_db/alphafold/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \ --bfd_database_path /n/shared_db/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \ --jackhmmer_binary_path $OPENFOLDDIR/openfold-conda/bin/jackhmmer \ --hhblits_binary_path $OPENFOLDDIR/openfold-conda/bin/hhblits \ --hhsearch_binary_path $OPENFOLDDIR/openfold-conda/bin/hhsearch \ --kalign_binary_path $OPENFOLDDIR/openfold-conda/bin/kalign

Multimer (protein complex) Job Example:

#!/bin/bash #SBATCH --partition=gpu # Partition to run in #SBATCH --gres=gpu:1 # GPU resources requested #SBATCH -c 4 # Requested cores #SBATCH --time=0-8:00 # Runtime in D-HH:MM format #SBATCH --mem=32GB # Requested Memory #SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j) #SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j) #SBATCH --mail-type=ALL # ALL email notification type #SBATCH --mail-user=<email_address> # Email to which notifications will be sent python3 $OPENFOLDDIR/openfold/run_pretrained_openfold.py \     --skip_relaxation \     --output_dir /PATH/TO/OUTPUT/DIR/ \     --model_device "cuda:0" \     --config_preset "model_1_ptm" \     --openfold_checkpoint_path $OPENFOLDDIR/openfold/openfold/resources/openfold_params/finetuning_ptm_2.pt \     --cpus 4 \ --multimer_ri_gap 200 \     /PATH/TO/INPUT/DIR/ \     /n/shared_db/alphafold/pdb_mmcif/mmcif_files/ \     --uniref90_database_path /n/shared_db/alphafold/uniref90/uniref90.fasta \     --mgnify_database_path /n/shared_db/alphafold/mgnify/mgy_clusters_2018_12.fa \     --pdb70_database_path /n/shared_db/alphafold/pdb70/pdb70 \     --uniclust30_database_path /n/shared_db/alphafold/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \     --bfd_database_path /n/shared_db/alphafold/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \     --jackhmmer_binary_path $OPENFOLDDIR/openfold-conda/bin/jackhmmer \     --hhblits_binary_path $OPENFOLDDIR/openfold-conda/bin/hhblits \     --hhsearch_binary_path $OPENFOLDDIR/openfold-conda/bin/hhsearch \     --kalign_binary_path $OPENFOLDDIR/openfold-conda/bin/kalign

The output will contain temporary files (.fasta, .json), along with a predictions directory that contains .pdb files.