Using AlphaFold on O2

AlphaFold (https://github.com/deepmind/alphafold ) is a new tool for predicting protein structures from DeepMind. It is available as an experimental module on O2. This means that some features may not work as expected, as the code itself was not designed with standard HPC environments in mind.

Compared to ColabFold (see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2180546561/Using+Local+ColabFold+on+O2 ), AlphaFold takes fewer parameters, and uses jackhmmer as an MSA generator instead of mmseqs2, which can make it slower than ColabFold for certain inputs. It may also require more resources to run. If you are unsure about which to use, feel free to try both tools and compare results.

Note: If you’re new to Slurm or O2, please see Using Slurm Basic for lots of information on submitting jobs.

How to load and use the AlphaFold module

Here are the instructions to run AlphaFold in an interactive session. Since AlphaFold takes hours to run, you will more likely want to submit a batch job, which is described later.

$ module load alphafold/2.2.0

alphafold.py --fasta_paths=/path/to/fastafile --max_template_date=2020-05-14 --db_preset=full_dbs --output_dir=/path/to/output --data_dir=/n/shared_db/alphafold/

You MUST pass the --data_dir=/n/shared_db/alphafold/ flag as in the above example

You MUST provide full paths for any files passed to alphafold.py

Example Submission Template

The following (abstracted) example was graciously submitted by collaborators at the Center for Computational Biomedicine.

For a single input FASTA file that hypothetically contains two proteins as separate entities (a protein complex), INPUT.fasta that takes the format:

>header1
(insert amino acid sequence here)
>header2
(insert amino acid sequence here)

the following example job submission script, SUBMIT.sh, can be created (replace the items indicated, including paths in --fasta_paths and --output_dir, without angle brackets where applicable):

#!/bin/bash

#SBATCH --partition=<INSERT NAME OF GPU PARTITION HERE>
#SBATCH --gres=gpu:2
#SBATCH -c 8
#SBATCH --time=5-0:00:00
#SBATCH --mem=50G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<YOUR_EMAIL_ADDRESS>

###

module load alphafold/2.1.1 gcc/6.2.0 cuda/11.2

alphafold.py --fasta_paths=</PATH/TO/>INPUT.fasta \
--is_prokaryote_list=false \
--max_template_date=2022-01-01 \
--db_preset=full_dbs \
--model_preset=multimer \
--output_dir=</PATH/TO/OUTPUT/DIRECTORY> \
--data_dir=/n/shared_db/alphafold/

For more sbatch customization options, you can refer to https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#sbatch-options-quick-reference . You can then submit this script via sbatch SUBMIT.sh from the terminal on O2.

Note that this is for version 2.1.1. If you are running a newer version, you may want to refer to the help output (e.g. invoke alphafold.py -h) for any notable nuances with flags that are version-specific. If you are interested in a single protein sequence, you might want to set --model_preset to monomer, for example. Also check the https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1995177985/Using+AlphaFold+on+O2#Other-Important-Details section for some tips.

Your output directory should contain several .pdb files, of which the “best” one should be called ranked_0.pdb. You should also find a .json file that contains the rankings, and .pkl files that contain metrics for each prediction. If one of these types of files is missing, it is possible that your run did not complete correctly.

Other Important Details

The number of threads/cores was hardcoded within the AlphaFold internal code, you should request ~8 CPU cores when submitting AlphaFold jobs. Requesting more cores will not improve performance, and it will make your job pend longer before it starts.
The required database was downloaded into a centralized location (i.e., /n/shared_db/alphafold/) for the benefit of all O2 users. Please don’t download the 2 terabytes yourself, as that would be a waste of space.
Submit your AlphaFold jobs to a GPU partition. For more about the GPU partitions, please visit our wiki page - Using O2 GPU resources.
For version 2.2.0, submissions will assume you are running on a GPU by default. If for some reason you desire to run explicitly on CPU, please specify the --use_cpu flag.
As of version 2.2.0, the base implementation has changed how the amber relaxation step is requested by the user. If you would like to run your analysis WITHOUT the relaxed models with the 2.2.0 module, please include the --no_run_relax flag.

If you have any questions, you can email us at rchelp@hms.harvard.edu.