AlphaFold (https://github.com/deepmind/alphafold ) is a new tool for predicting protein structures from DeepMind. It is available as an experimental module on O2. This means that some features may not work as expected, as the code itself was not designed with standard HPC environments in mind.

Compared to ColabFold (see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2180546561/Using+Local+ColabFold+on+O2 ), AlphaFold takes fewer parameters, and uses jackhmmer as an MSA generator instead of mmseqs2, which can make it slower than ColabFold for certain inputs. It may also require more resources to run. If you are unsure about which to use, feel free to try both tools and compare results.

Note: If you’re new to Slurm or O2, please see Using Slurm Basic for lots of information on submitting jobs.

AlphaFold is also offered by BioGrids as part of their software suite (Using Software Provided by BioGrids ). If you need assistance with using the BioGrids offering, please contact help@biogrids.org.

Before starting: if you are working with a single protein, check to see if it has not already been previously computed. A full database can be found at https://alphafold.ebi.ac.uk/ . It may save you a lot of time!

How to load and use the AlphaFold module

Here are the instructions to run AlphaFold in an interactive session. Since AlphaFold takes hours to run, you will more likely want to submit a batch job, which is described later.

The following flags are mandatory for invoking AlphaFold:

The --data_dir flag is not mandatory, but it will point to /n/shared_db/alphafold by default (for versions before 2.3.1), where RC has centrally downloaded the databases. If you would rather use your own (not recommended due to requiring approximately 2T of free space), feel free to set this flag with the corresponding location.

If you are using version 2.3.1, please use /n/shared_db/alphafold-2.3for --data-dir, as the model parameters have also changed (as well as some databases) for this release. If you are copy/pasting any of the below templates, please make sure to edit them accordingly.

You can invoke alphafold.py -h for more information about these, and other optional flags and their options.

The following is an example invocation of alphafold.py with a placeholder output path, including the module load step:

$ module load alphafold/2.2.0

alphafold.py --fasta_paths=/path/to/fastafile --max_template_date=2020-05-14 --db_preset=full_dbs --output_dir=/path/to/output --data_dir=/n/shared_db/alphafold/ 

As mentioned above, you MUST provide full paths for any fasta files passed to alphafold.py.

Example Submission Template

The following (abstracted) example was graciously submitted by collaborators at the Center for Computational Biomedicine.

For a single input FASTA file that hypothetically contains two proteins as separate entities (a protein complex), INPUT.fasta that takes the format:

>header1
(insert amino acid sequence here)
>header2
(insert amino acid sequence here)

the following example job submission script, SUBMIT.sh, can be created (replace the items indicated, including paths in --fasta_paths and --output_dir, without angle brackets where applicable):

#!/bin/bash

#SBATCH --partition=<INSERT NAME OF GPU PARTITION HERE>
#SBATCH --gres=gpu:1
#SBATCH -c 8
#SBATCH --time=5-0:00:00
#SBATCH --mem=50G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<YOUR_EMAIL_ADDRESS>

###

module load alphafold/2.1.1 gcc/6.2.0 cuda/11.2

alphafold.py --fasta_paths=</PATH/TO/>INPUT.fasta \
--is_prokaryote_list=false \
--max_template_date=2022-01-01 \
--db_preset=full_dbs \
--model_preset=multimer \
--output_dir=</PATH/TO/OUTPUT/DIRECTORY> \
--data_dir=/n/shared_db/alphafold/

AlphaFold does NOT support multiple GPUs. Please refrain from requesting more than one GPU per alphafold.py invocation, as this will not speed up your run time, and will inhibit your ability to have your job dispatched in a timely manner.

For more sbatch customization options, you can refer to https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#sbatch-options-quick-reference . You can then submit this script via sbatch SUBMIT.sh from the terminal on O2.

Note that this is for version 2.1.1. If you are running a newer version, you may want to refer to the help output (e.g. invoke alphafold.py -h) for any notable nuances with flags that are version-specific. If you are interested in a single protein sequence, you might want to set --model_preset to monomer, for example. Also check the https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1995177985/Using+AlphaFold+on+O2#Other-Important-Details section for some tips.

Your output directory should contain several .pdb files, of which the “best” one should be called ranked_0.pdb. You should also find a .json file that contains the rankings, and .pkl files that contain metrics for each prediction. If one of these types of files is missing, it is possible that your run did not complete correctly.

Other Important Details

If you have any questions, you can email us at rchelp@hms.harvard.edu.