Using (Local)ColabFold on O2
ColabFold (GitHub - sokrypton/ColabFold: Making Protein folding accessible to all! ) is an emerging protein folding prediction tool based on Google DeepMind’s Alphafold (see Using AlphaFold 2 on O2 ). LocalColabFold (GitHub - YoshitakaMo/localcolabfold: ColabFold on your local PC ) is a packaging of ColabFold for use on local machines; we provide instructions on how to leverage LocalColabFold on O2 below. LocalColabFold uses MMseqs2
(conditionally faster than jackhmmer
), and runs AlphaFold2 for single protein modeling and AlphaFold-Multimer for protein complex modeling. If you are unsure about which to use, feel free to try both tools and compare results. O2’s colabfold
module is loosely based on Localcolabfold’s deployment methods.
Note: If you’re new to Slurm or O2, please see Using Slurm Basic for lots of information on submitting jobs.
Using the Module
As of the current module (colabfold/1.5.5-8c55fc2), ColabFold does NOT support AlphaFold 3. For information on how to access and use AlphaFold 3, please refer to Using AlphaFold 3 on O2 .
ColabFold is available via our LMOD module system (Using Applications on O2 ).
The default behavior of the tool is to send proteins to the remote server for alignment before returning to leverage the local GPU resource, this should only be used for small queries. We recommend processing “large” volumes of proteins by creating MSAs locally (see Using (Local)ColabFold on O2 | Generating MSAs Using Local MMseqs2 below). See the Caveats section for more information.
To identify available versions:
$ module spider colabfold
To access the module (for example, colabfold/1.5.5-8c55fc2
):
$ module load colabfold/1.5.5-8c55fc2
A snapshot of the help text follows:
$ module help colabfold/1.5.5-8c55fc2
--------------------------------- Module Specific Help for "colabfold/1.5.5-8c55fc2" ---------------------------------
This module is loosely based on https://github.com/YoshitakaMo/localcolabfold
For instructions, refer to:
https://github.com/sokrypton/colabfold
This module is based on the 8c55fc2548823d57bb23c24a6b2573348e6d51c8 commit of the colabfold repository.
Databases are located at `/n/shared_db/colabfold/1.5.5-8c55fc2`, and were generated using scripts provided by colabfold running mmseqs from mmseqs2/17-b804f.
This output shows the last time the module was updated. ColabFold has developed quickly at times. If you are looking for a bleeding edge version of ColabFold, you can install your own copy (Installing LocalColabFold Locally | Installing LocalColabFold) and manually keep it up to date.
Generating MSAs Using Local MMseqs2
Generating MSAs locally using MMseqs2 reduces the load on remote servers managed by Colabfold developers, and allows users to run larger batches without risk of being limited (see Using (Local)ColabFold on O2 | Caveats ). MMseqs2 can be loaded from within the ColabFold module:
$ module load colabfold/1.5.5-8c55fc2
$ module list
Currently Loaded Modules:
1) mmseqs2/17-b804f 2) colabfold/1.5.5-8c55fc2 (E)
Where:
E: Experimental
Parameters for using MMseqs2 through the command colabfold_search
are revealed by loading the modules above in an interactive session:
$ colabfold_search -h
MMseqs2 accepts .fasta
files containing multiple amino acid sequences as input,
>firstfastasequencename
MWELRSIAFSRAVFAEFLATLLFVFFGLGSALNWPQALPS
>secondfastasequencename
CSMNPARSLAPAVVTGKFDDHWVFWIGPLVGAILGSLLYN
including complexes where proteins are separated with a colon (:).
>firstxsecondcomplex
MWELRSIAFSRAVFAEFLATLLFVFFGLGSALNWPQALPS:CSMNPARSLAPAVVTGKFDDHWVFWIGPLVGAILGSLLYN
These inputs can contain a single sequence, or a "batch" of several proteins as input. The path to this file should be included in a colabfold_search
command. We have public databases available in /n/shared_db/
(Public Databases) and we will use these for database paths in the simplified examples below:
#Version 17-b804f (loaded with colabfold/1.5.5-8c55fc2)
colabfold_search \
--db-load-mode 2 \
--mmseqs mmseqs \
--use-env 1 \
--use-templates 0 \
--threads 3 \
/PATH/TO/INPUT.fasta /n/shared_db/colabfold/1.5.5-8c55fc2 /PATH/TO/OUTPUT/DIRECTORY
If you are trying to access older versions of databases, investigate the use of the --db1
, --db2
, --db3
, and --db4
flags. You can view what these flags do by invoking colabfold_search -h
. If your older database still will not parse, you may need to install an older version of mmseqs2 - Research Computing can assist with this as necessary.
These commands can be combined with a sbatch
(Using Slurm Basic | The sbatch command ) script. The resources required to complete a LocalColabFold job may vary by structure and complexity. It is generally best to start with a relatively conservative request for resources, then increase as needed based on information from past jobs. This information can be found using commands like O2_jobs_report
(Get information about current and past jobs | O2_jobs_report ). Below is a simplified example of an sbatch
script that runs the file INPUT.fasta
against colabfold_search
on the short
partition:
#!/bin/bash
#SBATCH -c 4 # Requested cores
#SBATCH --time=0-12:00 # Runtime in D-HH:MM format
#SBATCH --partition=short # Partition to run in
#SBATCH --mem=24GB # Requested Memory
#SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j)
#SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j)
#SBATCH --mail-type=ALL # ALL email notification type
#SBATCH --mail-user=<email_address> # Email to which notifications will be sent
module load colabfold/1.5.5-8c55fc2
colabfold_search \
--db-load-mode 2 \
--mmseqs mmseqs \
--use-env 1 \
--use-templates 0 \
--threads 4 \
/PATH/TO/INPUT.fasta /n/shared_db/colabfold/1.5.5-8c55fc2 /PATH/TO/OUTPUT/DIRECTORY
The output should include MSAs in .a3m
format. These can be submitted to LocalColabFold as input in the next section, similar to a FASTA file.
Additional Considerations
As of the mmseqs2/17-b804f
module version, mmseqs2
now supports generating MSAs on GPU resources, which may result in significant performance improvements (as well as not needing to wait for the remote server to return a payload).
For more information about this, please refer to GitHub - sokrypton/ColabFold: Making Protein folding accessible to all! . O2-specific usage examples are forthcoming. Please contact CCB (Core for Computational Biomedicine, ccbhelp@hms.harvard.edu) if having difficulties adapting the instructions on the repository website.
Executing ColabFold On O2
ColabFold can be loaded as a module by running:
$ module load colabfold/1.5.5-8c55fc2
Once you have loaded these modules, you’ll want to submit your job to the gpu
(or gpu_quad
if you have access) partition so that you can leverage GPU resources (Using O2 GPU resources ). Parameters for using LocalColabFold through the command colabfold_batch
can be shown by loading the modules above in an interactive session (Using Slurm Basic | Interactive Sessions ) and running:
$ colabfold_batch -h
At the moment, we do not recommend invoking ‘--amber’ or ‘--templates’ flags since these lead some jobs to fail
ColabFold accepts both .a3m
and .fasta
files containing amino acid sequences, including complexes where proteins are separated with a colon (:). These inputs can contain a single sequence, or a "batch" of several proteins as input. The path to this file should be included in a colabfold_batch
command. Below is a simplified example of a colabfold_batch
command (graciously provided by the Center for Computational Biomedicine):
$ colabfold_batch --num-recycle 5 \
--model-type auto \
--rank auto \
/PATH/TO/INPUT \
/PATH/TO/DATABASES \
/PATH/TO/OUTPUT/DIRECTORY
Similar to the previous scripts mentioned, it is best to start with a modest resource request and slowly increase as needed. Below is a simplified example of a sbatch
script that runs the file INPUT.fasta
against colabfold_batch
on the gpu
partition:
#!/bin/bash
#SBATCH --partition=gpu # Partition to run in
#SBATCH --gres=gpu:1 # GPU resources requested
#SBATCH -c 1 # Requested cores
#SBATCH --time=0-12:00 # Runtime in D-HH:MM format
#SBATCH --mem=128GB # Requested Memory
#SBATCH -o %j.out # File to which STDOUT will be written, including job ID (%j)
#SBATCH -e %j.err # File to which STDERR will be written, including job ID (%j)
#SBATCH --mail-type=ALL # ALL email notification type
#SBATCH --mail-user=<email_address> # Email to which notifications will be sent
module load colabfold/1.5.5-8c55fc2
colabfold_batch --num-recycle 5 \
--model-type auto \
--rank auto \
/PATH/TO/INPUT \
/n/shared_db/colabfold/1.5.5-8c55fc2
/PATH/TO/OUTPUT/DIRECTORY
ColabFold does NOT support multiple GPUs. Please refrain from requesting more than one GPU per colabfold_batch
invocation, as this will not speed up your run time, and will inhibit your ability to have your job dispatched in a timely manner.
The output directory will contain several .pdb
, .json
, and .png
files for the predicted complex structure. This includes pLDDT
and PAE
metrics that assess the accuracy of each prediction. The 'best' ranked structure will be called {sample_id}_unrelaxed_rank_1_model_{i}.pdb
.
Caveats
LocalColabFold is a repackaging of ColabFold for local use. This means that LocalColabFold requires all the same local hardware resources and connections that ColabFold would require (but without the Google Colab soft dependency). This includes the allowing of shipping the protein sequence to a remote server maintained by the ColabFold developers for processing during the alignment step. This server is shared by all users of ColabFold, and is not an HPC environment to our knowledge. This means that LARGE BATCHES OF PROTEIN ALIGNMENTS MUST BE GENERATED LOCALLY USING MMSEQS2, regardless of whether you are using the O2 module or your own installation on O2. At this time, the developers define large as “a few thousand” sequences. This could change, and is at the discretion of the system administrators maintaining the remote server. Please be considerate of other ongoing analysis on O2 when submitting large queries.
Large volumes of submissions to the remote server may cause the submitting compute node’s IP address to be rate-limited, or even blacklisted, which will impact all users of LocalColabFold on O2 that land on that compute node. Furthermore, because there are a limited number of compute nodes with GPU resources, if volume is high enough, all of O2’s GPU compute nodes can easily be blacklisted in a short amount of time.
Troubleshooting/FAQ
Errors with using --amber
or --templates
As noted above, occasionally jobs will fail if either of the above flags are enabled - this is a known issue and requires action from the ColabFold developers. For now, simply resubmit the job without these flags, or if these functions are required for your work, you can also try submitting your sequences against Alphafold (and adjust your resource requirements accordingly).
Please contact rchelp@hms.harvard.edu with any questions regarding the module or troubleshooting the installation process that this section does not address or addresses insufficiently. Depending on the question, we may need to refer you to the developers, but we will do our best to assist.
My jobs are unable to detect GPUs and crash, CUDA_ERROR_NOT_FOUND
error
This section refers to errors that look something like this:
2024-07-15 08:39:57.833390: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:628] failed to get PTX kernel "shift_right_logical" from module: CUDA_ERROR_NOT_FOUND: named symbol not found
2024-07-15 08:39:57.833441: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: INTERNAL: Could not find the corresponding function
Make sure you are submitting to a GPU-enabled partition and are requesting a GPU resource (see Using O2 GPU resources ).