Using AlphaFold on O2

AlphaFold (https://github.com/deepmind/alphafold ) is a new tool for predicting protein structures from DeepMind. It is available as an experimental module on O2. This means that some features may not work as expected, as the code itself was not designed with standard HPC environments in mind.

Compared to ColabFold (see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2180546561/Using+Local+ColabFold+on+O2 ), AlphaFold takes fewer parameters, and uses jackhmmer as an MSA generator instead of mmseqs2, which can make it slower than ColabFold for certain inputs. It may also require more resources to run. If you are unsure about which to use, feel free to try both tools and compare results.

Note: If you’re new to Slurm or O2, please see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632 for lots of information on submitting jobs.

AlphaFold is also offered by BioGrids as part of their software suite ( ). If you need assistance with using the BioGrids offering, please contact help@biogrids.org.

Before starting: if you are working with a single protein, check to see if it has not already been previously computed. A full database can be found at . It may save you a lot of time!

How to load and use the AlphaFold module

Here are the instructions to run AlphaFold in an interactive session. Since AlphaFold takes hours to run, you will more likely want to submit a batch job, which is described later.

The following flags are mandatory for invoking AlphaFold:

  • --fasta_paths specifies the location of your fasta files (this cannot be a directory, but it can be a comma-separated list of full paths.

  • --max_template_date specifies the latest date to reference when matching against templates. As of 2.2.0, there is no way to “turn off” templates - you can simply provide a very early date to make sure no templates survive the date filter, e.g. --max_template_date=1950-01-01 .

  • --output_dir specifies the directory where your output will be written to.

The --data_dir flag is not mandatory, but it will point to /n/shared_db/alphafold by default (for versions before 2.3.1), where RC has centrally downloaded the databases. If you would rather use your own (not recommended due to requiring approximately 2T of free space), feel free to set this flag with the corresponding location.

If you are using version 2.3.1, please use /n/shared_db/alphafold-2.3for --data-dir, as the model parameters have also changed (as well as some databases) for this release. If you are copy/pasting any of the below templates, please make sure to edit them accordingly.

You can invoke alphafold.py -h for more information about these, and other optional flags and their options.

The following is an example invocation of alphafold.py with a placeholder output path, including the module load step:

$ module load alphafold/2.2.0 alphafold.py --fasta_paths=/path/to/fastafile --max_template_date=2020-05-14 --db_preset=full_dbs --output_dir=/path/to/output --data_dir=/n/shared_db/alphafold/

As mentioned above, you MUST provide full paths for any fasta files passed to alphafold.py.

Example Submission Template

The following (abstracted) example was graciously submitted by collaborators at the Center for Computational Biomedicine.

For a single input FASTA file that hypothetically contains two proteins as separate entities (a protein complex), INPUT.fasta that takes the format:

>header1 (insert amino acid sequence here) >header2 (insert amino acid sequence here)

the following example job submission script, SUBMIT.sh, can be created (replace the items indicated, including paths in --fasta_paths and --output_dir, without angle brackets where applicable):

#!/bin/bash #SBATCH --partition=<INSERT NAME OF GPU PARTITION HERE> #SBATCH --gres=gpu:1 #SBATCH -c 8 #SBATCH --time=5-0:00:00 #SBATCH --mem=50G #SBATCH --mail-type=ALL #SBATCH --mail-user=<YOUR_EMAIL_ADDRESS> ### module load alphafold/2.1.1 gcc/6.2.0 cuda/11.2 alphafold.py --fasta_paths=</PATH/TO/>INPUT.fasta \ --is_prokaryote_list=false \ --max_template_date=2022-01-01 \ --db_preset=full_dbs \ --model_preset=multimer \ --output_dir=</PATH/TO/OUTPUT/DIRECTORY> \ --data_dir=/n/shared_db/alphafold/

AlphaFold does NOT support multiple GPUs. Please refrain from requesting more than one GPU per alphafold.py invocation, as this will not speed up your run time, and will inhibit your ability to have your job dispatched in a timely manner.

For more sbatch customization options, you can refer to . You can then submit this script via sbatch SUBMIT.sh from the terminal on O2.

Note that this is for version 2.1.1. If you are running a newer version, you may want to refer to the help output (e.g. invoke alphafold.py -h) for any notable nuances with flags that are version-specific. If you are interested in a single protein sequence, you might want to set --model_preset to monomer, for example. Also check the section for some tips.

Your output directory should contain several .pdb files, of which the “best” one should be called ranked_0.pdb. You should also find a .json file that contains the rankings, and .pkl files that contain metrics for each prediction. If one of these types of files is missing, it is possible that your run did not complete correctly.

Other Important Details

  • The number of threads/cores was hardcoded within the AlphaFold internal code, you should request ~8 CPU cores when submitting AlphaFold jobs. Requesting more cores will not improve performance, and it will make your job pend longer before it starts.

  • The required database was downloaded into a centralized location (i.e., /n/shared_db/alphafold/ for < 2.3.1, and /n/shared_db/alphafold-2.3 for 2.3.1) for the benefit of all O2 users. Please don't download the 2 terabytes yourself, as that would be a waste of space.

  • Submit your AlphaFold jobs to a GPU partition. For more about the GPU partitions, please visit our wiki page - .

  • For version 2.2.0, submissions will assume you are running on a GPU by default. If for some reason you desire to run explicitly on CPU, please specify the --use_cpu flag.

  • As of version 2.2.0, the base implementation has changed how the amber relaxation step is requested by the user. If you would like to run your analysis WITHOUT the relaxed models with the 2.2.0 module, please include the --no_run_relax flag.

  • There is a known issue with Alphafold (all versions) not being able to successfully perform the relaxation step if requested via GPU (the default option); we can only recommend that users refrain from utilizing the relaxation option until the developers address this. We have received user reports that using CPU relaxation is successful, however (--enable_cpu_relax), so users can attempt to use this flag if relaxation is required.

  • AlphaFold jobs may fail with Out of Memory included in the .out or .err of the job. This is referring to VRAM or ram automatically allocated on GPU cards. We could try running these larger complexes using cpu-only. This will make them run slower, but they won't be bottle-necked by the maximum VRAM available on a GPU node. To view the amount of VRAM available on any one card try a command like:

 

If you have any questions, you can email us at rchelp@hms.harvard.edu.