ColabFold (https://github.com/sokrypton/ColabFold ) is an emerging protein folding prediction tool based on Google DeepMind’s Alphafold (see Using AlphaFold on O2 ). LocalColabFold (https://github.com/YoshitakaMo/localcolabfold )is a packaging of ColabFold for use on local machines; we provide instructions on how to leverage LocalColabFold on O2 below. LocalColabFold uses MMseqs2
(conditionally faster than jackhmmer
), and runs AlphaFold2 for single protein modeling and AlphaFold-Multimer for protein complex modeling. If you are unsure about which to use, feel free to try both tools and compare results.
...
Via internal testing, Research Computing has discovered that even though install_colabbatch_linux.sh
installs a local copy of (GCC and) CUDA, it is unable to leverage these resources, and we were unable to provide access to these local copies. The present workaround is to use the O2 modules in their place as above.
Executing LocalColabFold On O2
LocalColabFold can be loaded as a module by running:
...
Generating MSAs Using Local MMseqs2
Generating MSAs locally using MMseqs2 reduces the load on remote servers managed by Colabfold developers, and allows users to run “larger” batches without risk of being limited (see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2180546561/Using+Local+ColabFold+on+O2#Caveats ). MMseqs2 can be loaded from within the LocalColabFold module:
Code Block |
---|
$ module load gcc/9.2.0 cuda/11.2 localcolabfold/latest |
Parameters for using MMseqs2 through the command colabfold_search
are revealed by loading the modules above in an interactive session:
Code Block |
---|
$ colabfold_search -h |
MMseqs2 accepts .fasta
files containing amino acid sequences as input, including complexes where proteins are separated with a colon (:). These inputs can contain a single sequence, or a "batch" of several proteins as input. The path to this file should be included in a colabfold_search
command. We have public databases available in /n/shared_db/
(Public Databases) and we will use these for database paths in the simplified example below:
Code Block |
---|
colabfold_search --db1 uniref30/uniref30_2103_db \
--db3 colabfold/colabfold_envdb_202108_db \
--mmseqs mmseqs \
--use-env 1 \
--use-templates 0 \
--threads 3 \
/PATH/TO/INPUT.fasta /n/shared_db/misc/mmseqs2 /PATH/TO/OUTPUT/DIRECTORY |
These commands can be combined with a sbatch
(https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#The-sbatch-command ) script. The resources required to complete a sbatch
job may vary by structure and complexity. It is generally best to start with a relatively conservative request for resources, then increase as needed based on information from past jobs. This information can be found using commands like O2sacct
(Get information about current and past jobs ). Below is a simplified example of an sbatch
script that runs the file INPUT.fasta
against colabfold_search
on the short
partition:
Code Block |
---|
#!/bin/bash
#SBATCH -c 4 # Requested cores
#SBATCH --time=0-12:00 # Runtime in D-HH:MM format
#SBATCH --partition=short # Partition to run in
#SBATCH --mem=100GB # Requested Memory
#SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j)
#SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j)
#SBATCH --mail-type=ALL # ALL email notification type
#SBATCH --mail-user=<email_address> # Email to which notifications will be sent
module load gcc/6.2.0 cuda/11.2 localcolabfold/latest
colabfold_search --db1 uniref30/uniref30_2103_db \
--db3 colabfold/colabfold_envdb_202108_db \
--mmseqs mmseqs \
--use-env 1 \
--use-templates 0 \
--threads 3 \
/PATH/TO/INPUT.fasta /n/shared_db/misc/mmseqs2 /PATH/TO/OUTPUT/DIRECTORY |
The output should include MSAs in .a3m
format. These can be submitted to LocalColabFold as input in the next section, similar to a FASTA file.
Executing LocalColabFold On O2
LocalColabFold can be loaded as a module by running:
Code Block |
---|
$ module load gcc/9.2.0 cuda/11.2 localcolabfold/latest |
or, if LocalColabFold has been installed locally, please make sure it is visible in your PATH
variable for loading. Once you have loaded these modules, you’ll want to submit your job to the gpu
(or gpu_quad
if you have access) partition so that you can leverage GPU resources (Using O2 GPU resources ). Parameters for using LocalColabFold through the command colabfold_batch
can be shown by loading the modules above in an interactive session (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Interactive-Sessions ) and running:
Code Block |
---|
$ colabfold_batch -h |
Note |
---|
At the moment, we do not recommend invoking ‘--amber’ or ‘--templates’ flags since these lead some jobs to fail |
LocalColabFold accepts FASTA both .a3m
and .fasta
files containing amino acid sequences, including complexes where proteins are separated with a colon (:). These FASTA files inputs can contain a single sequence, or a “batch” "batch" of several proteins as input. The path to this file should be included in a colabfold_batch
command. Below is an a simplified example of a colabfold_batch
command (graciously provided by the Center for Computational Biomedicine):
Code Block |
---|
$ colabfold_batch --num-recycle 5 \ --model-type AlphaFold2-multimer \ --rank ptmscore \ /PATH/TO/INPUT.fasta \ /PATH/TO/OUTPUT/DIRECTORY |
These commands can be combined in an a sbatch
(https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#The-sbatch-command ) script. The resources required to complete an a sbatch
job may vary by structure and complexity. It is generally best to start with a relatively conservative request for resources, then increase as needed based on information from past jobs. This information can be found using commands like O2sacct
(Get information about current and past jobs ). Below is a simplified example of an a sbatch
script that runs the file INPUT.fasta
against colabfold_batch
on the gpu
partition:
Code Block |
---|
#!/bin/bash
#SBATCH --partition=gpu # Partition to run in
#SBATCH --gres=gpu:1 # GPU resources requested
#SBATCH -c 1 # Requested cores
#SBATCH --time=0-12:00 # Runtime in D-HH:MM format
#SBATCH --mem=25GB # Requested Memory
#SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j)
#SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j)
#SBATCH --mail-type=ALL # ALL email notification type
#SBATCH --mail-user=<email_address> # Email to which notifications will be sent
module load gcc/6.2.0 cuda/11.2 localcolabfold/latest
colabfold_batch --num-recycle 5 \
--model-type AlphaFold2-multimer \
--rank ptmscore \
/PATH/TO/INPUT.fasta \
/PATH/TO/OUTPUT/DIRECTORY |
...
LocalColabFold is a repackaging of ColabFold for local use. This means that LocalColabFold requires all the same local hardware resources and connections that ColabFold would require (but without the Google Colab soft dependency). This includes the allowing of shipping the protein sequence to a remote server maintained by the ColabFold developers for processing during the alignment step. This server is shared by all users of ColabFold, and is not an HPC environment to our knowledge. This means thatSUBMITTING LARGE BATCHES OF PROTEINS IS NOT RECOMMENDED AT THIS TIMEMUST BE GENERATED LOCALLY USING MMSEQS2, regardless of whether you are using the O2 module or your own installation on O2. Do note that we are unable to quantify “large”, as this is to the discretion of the system administrators maintaining the remote server.
Large volumes of submissions to the remote server may cause the submitting compute node’s IP address to be rate-limited, or even blacklisted, which will impact all users of LocalColabFold on O2 that land on that compute node. Furthermore, because there are a limited number of compute nodes with GPU resources, if volume is high enough, all of O2’s GPU compute nodes can easily be blacklisted in a short amount of time.Research Computing is working on a method to perform the alignment step locally, but it may result in longer processing times per protein. This page will be updated when this method is ready for consumption.
Troubleshooting/FAQ
Errors with using --amber
or --templates
...