NOTICE: FULL O2 Cluster Outage, January 3 - January 10th
O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10
- on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
- on Jan 3 (6:00 PM): O2 systems will start being powered off.
This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.
Specifically:
- The O2 Cluster will be completely offline, including O2 Portal.
- All data on O2 will be inaccessible.
- Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
- Websites on O2 will be completely offline, including all web content.
More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation
Using AlphaFold on O2
AlphaFold (GitHub - google-deepmind/alphafold: Open source code for AlphaFold 2. ) is a new tool for predicting protein structures from DeepMind. It is available as an experimental module on O2. This means that some features may not work as expected, as the code itself was not designed with standard HPC environments in mind.
Compared to ColabFold (see Using (Local)ColabFold on O2 ), AlphaFold takes fewer parameters, and uses jackhmmer
as an MSA generator instead of mmseqs2
, which can make it slower than ColabFold for certain inputs. It may also require more resources to run. If you are unsure about which to use, feel free to try both tools and compare results.
Note: If you’re new to Slurm or O2, please see Using Slurm Basic for lots of information on submitting jobs.
AlphaFold is also offered by BioGrids as part of their software suite (https://harvardmed.atlassian.net/wiki/pages/createpage.action?spaceKey=O2&title=Using%20Software%20Provided%20by%20BioGrids&linkCreation=true&fromPageId=1995177985 ). If you need assistance with using the BioGrids offering, please contact help@biogrids.org.
Before starting: if you are working with a single protein, check to see if it has not already been previously computed. A full database can be found at AlphaFold Protein Structure Database . It may save you a lot of time!
How to load and use the AlphaFold module
Here are the instructions to run AlphaFold in an interactive session. Since AlphaFold takes hours to run, you will more likely want to submit a batch job, which is described later.
The following flags are mandatory for invoking AlphaFold:
--fasta_paths
specifies the location of your fasta files (this cannot be a directory, but it can be a comma-separated list of full paths.--max_template_date
specifies the latest date to reference when matching against templates. As of 2.2.0, there is no way to “turn off” templates - you can simply provide a very early date to make sure no templates survive the date filter, e.g.--max_template_date=1950-01-01
.--output_dir
specifies the directory where your output will be written to.
The --data_dir
flag is not mandatory, but it will point to /n/shared_db/alphafold
by default (for versions before 2.3.1), where RC has centrally downloaded the databases. If you would rather use your own (not recommended due to requiring approximately 2T of free space), feel free to set this flag with the corresponding location.
If you are using version 2.3.1, please use /n/shared_db/alphafold-2.3
for --data-dir
, as the model parameters have also changed (as well as some databases) for this release. If you are copy/pasting any of the below templates, please make sure to edit them accordingly.
You can invoke alphafold.py -h
for more information about these, and other optional flags and their options.
The following is an example invocation of alphafold.py
with a placeholder output path, including the module load step:
$ module load alphafold/2.2.0
alphafold.py --fasta_paths=/path/to/fastafile --max_template_date=2020-05-14 --db_preset=full_dbs --output_dir=/path/to/output --data_dir=/n/shared_db/alphafold/
As mentioned above, you MUST provide full paths for any fasta files passed to alphafold.py.
Example Submission Template
The following (abstracted) example was graciously submitted by collaborators at the Center for Computational Biomedicine.
For a single input FASTA file that hypothetically contains two proteins as separate entities (a protein complex), INPUT.fasta
that takes the format:
>header1
(insert amino acid sequence here)
>header2
(insert amino acid sequence here)
the following example job submission script, SUBMIT.sh
, can be created (replace the items indicated, including paths in --fasta_paths
and --output_dir
, without angle brackets where applicable):
#!/bin/bash
#SBATCH --partition=<INSERT NAME OF GPU PARTITION HERE>
#SBATCH --gres=gpu:1
#SBATCH -c 8
#SBATCH --time=5-0:00:00
#SBATCH --mem=50G
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<YOUR_EMAIL_ADDRESS>
###
module load alphafold/2.1.1 gcc/6.2.0 cuda/11.2
alphafold.py --fasta_paths=</PATH/TO/>INPUT.fasta \
--is_prokaryote_list=false \
--max_template_date=2022-01-01 \
--db_preset=full_dbs \
--model_preset=multimer \
--output_dir=</PATH/TO/OUTPUT/DIRECTORY> \
--data_dir=/n/shared_db/alphafold/
AlphaFold does NOT support multiple GPUs. Please refrain from requesting more than one GPU per alphafold.py
invocation, as this will not speed up your run time, and will inhibit your ability to have your job dispatched in a timely manner.
For more sbatch
customization options, you can refer to Using Slurm Basic | sbatch options quick reference . You can then submit this script via sbatch SUBMIT.sh
from the terminal on O2.
Note that this is for version 2.1.1. If you are running a newer version, you may want to refer to the help output (e.g. invoke alphafold.py -h
) for any notable nuances with flags that are version-specific. If you are interested in a single protein sequence, you might want to set --model_preset
to monomer
, for example. Also check the Using AlphaFold on O2 | Other Important Details section for some tips.
Your output directory should contain several .pdb
files, of which the “best” one should be called ranked_0.pdb
. You should also find a .json
file that contains the rankings, and .pkl
files that contain metrics for each prediction. If one of these types of files is missing, it is possible that your run did not complete correctly.
Other Important Details
The number of threads/cores was hardcoded within the AlphaFold internal code, you should request ~8 CPU cores when submitting AlphaFold jobs. Requesting more cores will not improve performance, and it will make your job pend longer before it starts.
The required database was downloaded into a centralized location (i.e.,
/n/shared_db/alphafold/
for < 2.3.1, and/n/shared_db/alphafold-2.3
for 2.3.1) for the benefit of all O2 users. Please don't download the 2 terabytes yourself, as that would be a waste of space.Submit your AlphaFold jobs to a GPU partition. For more about the GPU partitions, please visit our wiki page - Using O2 GPU resources.
For version 2.2.0, submissions will assume you are running on a GPU by default. If for some reason you desire to run explicitly on CPU, please specify the
--use_cpu
flag.As of version 2.2.0, the base implementation has changed how the amber relaxation step is requested by the user. If you would like to run your analysis WITHOUT the relaxed models with the 2.2.0 module, please include the
--no_run_relax
flag.There is a known issue with Alphafold (all versions) not being able to successfully perform the relaxation step if requested via GPU (the default option); we can only recommend that users refrain from utilizing the relaxation option until the developers address this. We have received user reports that using CPU relaxation is successful, however (
--enable_cpu_relax
), so users can attempt to use this flag if relaxation is required.AlphaFold jobs may fail with
Out of Memory
included in the .out or .err of the job. This is referring to VRAM or ram automatically allocated on GPU cards. We could try running these larger complexes using cpu-only. This will make them run slower, but they won't be bottle-necked by the maximum VRAM available on a GPU node. To view the amount of VRAM available on any one card try a command like:
Troubleshooting
My jobs are unable to detect GPUs and crash, CUDA_ERROR_NOT_FOUND
error
If you are seeing an error that is similar to Using (Local)ColabFold on O2 | My jobs are unable to detect GPUs and crash, CUDA_ERROR_NOT_FOUND error with the same runtime conditions, please use the same workaround (excluding those compute nodes from your submission script.
HMS IT is currently working on installing and testing a newer Alphafold module that will hopefully not have this issue. This section will be updated once that module is available.
If you have any questions, you can email us at rchelp@hms.harvard.edu.