Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Have had a few users report that actual ram usage from latest dataset is much lower than 128. Affects jobs so much that priority drops.

ColabFold (https://github.com/sokrypton/ColabFold ) is an emerging protein folding prediction tool based on Google DeepMind’s Alphafold (see Using AlphaFold on O2 ). LocalColabFold (https://github.com/YoshitakaMo/localcolabfold ) is a packaging of ColabFold for use on local machines; we provide instructions on how to leverage LocalColabFold on O2 below. LocalColabFold uses MMseqs2 (conditionally faster than jackhmmer), and runs AlphaFold2 for single protein modeling and AlphaFold-Multimer for protein complex modeling. If you are unsure about which to use, feel free to try both tools and compare results.

...

LocalColabFold is available via our LMOD module system (Using Applications on O2 ). As development of ColabFold is relatively fast, we have labeled the last updated time stamp in the help output of the module. We cannot guarantee that this module will be updated in a timely manner, so if you are interested in keeping up to date, please refer to the Installing LocalColabFold Locally section to learn how to set up a local installation you can maintain.

Presently, the module is hidden while we work out additional dependencies of the module. Though it presently still “works”, we do not currently recommend processing a large volume of proteins, as the default behavior of the tool is to send proteins to the remote server for alignment before returning to leverage the local GPU resource. When this nuance is resolved, this page will be updated accordingly; see the Caveats section for more information.

To access the module:

Code Block
$ module load localcolabfold/latest

A snapshot of the help text follows:

...

The default behavior of the tool is to send proteins to the remote server for alignment before returning to leverage the local GPU resource, this should only be used for small queries. We recommend processing “large” volumes of proteins by creating MSAs locally (see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2180546561/Using+Local+ColabFold+on+O2#Generating-MSAs-Using-Local-MMseqs2 below). See the Caveats section for more information.

To access the module:

Code Block
$ module load localcolabfold/1.5.2

A snapshot of the help text follows:

Code Block
$ module help localcolabfold/1.5.2

------------------------------------ Module Specific Help for "localcolabfold/latest" ----------------------------For detailed instructions, go to:
https://github.com/YoshitakaMo/localcolabfold

This module was last installed on February 9, 2022 using the latest colabfold commit as of ~4:30pm ET.

Due to frequency in development updates, this module may be reinstalled any time.
Please refer to the timestamp above for most recent installation time.-------------- Module Specific Help for "localcolabfold/.1.5.2" ------------------------------------------------------------------------------
For detailed usage instructions, go to: https://github.com/YoshitakaMo/localcolabfold

This module was created loosely based on the process outlined in the YoshitakaMo/localcolabfold repository (release 1.5.1). However, this module packages colabfold release 1.5.2.

This module currently requires gcc/9.2.0 to be loaded due to requiring external cuda libraries.
If you are working under a different compiler stack (e.g. gcc/6.2.0), you may want to install this yourself
until we offer an updated version of the cuda module under a different compiler. Visit the repository website
for more information about how to install this yourself.

This output shows the last time the module was updated. ColabFold has developed quickly at times. If you are looking for a bleeding edge version of ColabFold, it may be preferable for you to can install it yourself and manually keep your installation up to date with developments that are pertinent to your work.

Installing LocalColabFold Locally

If you would like to maintain control over when LocalColabFold is updated, you may choose to install it to a local folder under your direct control. We outline instructions on how to do so below.

First, begin an interactive session (https://harvardmed.atlassian.net/your own copy (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/15867936322299789313/UsingInstalling+SlurmLocalColabFold+Basic#InteractiveLocally#Installing-Sessions ) and load the conda module (preferably with nothing else loaded, run module purge to remove all active modules from your current environment):

Code Block
$ module load miniconda3/4.10.3

Then, cd to a location where you would like to clone the LocalColabFold repository (https://github.com/YoshitakaMo/localcolabfold ), then do soLocalColabFold) and manually keep it up to date.

Generating MSAs Using Local MMseqs2

Generating MSAs locally using MMseqs2 reduces the load on remote servers managed by Colabfold developers, and allows users to run larger batches without risk of being limited (see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2180546561/Using+Local+ColabFold+on+O2#Caveats ). MMseqs2 can be loaded from within the LocalColabFold module:

Code Block
$ gitmodule cloneload https://github.com/YoshitakaMo/localcolabfold.git

This command will create a folder in your working directory called localcolabfold. cd into this directory. Then, if you ls this location, you should see the files that are in the repository view on GitHub accessible via your terminal in your local folder that was just created. We are interested in the install_colabbatch_linux.sh file. Remember the path to this file (we will refer to this path as /path/to/install_colabbatch_linux.sh from now on; replace it with your own path). Now, cd to a location where you would like the environment to live, then invoke this script:

Code Block
$ cd /path/to/desired/location
$ sh /path/to/install_colabbatch_linux.sh

This should create a directory at /path/to/desired/location called colabfold_batch. Wait for the installation to finish.

Keeping Your LocalColabFold Installation Updated

The main reason you are probably maintaining a personal LocalColabFold installation is likely because you would like to use the newest features as they are implemented into ColabFold, without waiting for the module to be updated erratically (or perhaps not at all depending on stability). To keep your installation updated, return to the location where you cloned the repository (recall that it was called localcolabfold when it was created via git clone:

Code Block
$ cd /path/to/localcolabfold

Now that you’re here, you’ll want to make sure that these scripts are updated:

Code Block
$ git pull

Now, you’ll want to run the update_linux.sh script, and pass the path to your environment as an argument:

Code Block
$ sh update_linux.sh /path/to/desired/location/colabfold_batch

After this is complete, you should have the newest version of ColabFold baked into your LocalColabFold installation.

Executing LocalColabFold Locally

Now that you have installed LocalColabFold locally, there is one more hurdle to using this installation.

First, make sure the bin subdirectory is added to your PATH variable:

Code Block
$ export PATH=/path/to/desired/location/colabfold_batch/bin:$PATH

If you prefer, you can paste this line into your ~/.bashrc file instead so that it is automatically set up each time you log in to O2.

In order to leverage GPU resources, you will need to load a CUDA module, which in turn requires you to load a GCC module:

Code Block
$ module load gcc/9.2.0 cuda/11.2

Via internal testing, Research Computing has discovered that even though install_colabbatch_linux.sh installs a local copy of (GCC and) CUDA, it is unable to leverage these resources, and we were unable to provide access to these local copies. The present workaround is to use the O2 modules in their place as above.

Executing LocalColabFold On O2

LocalColabFold can be loaded as a module by running:

Code Block
module load gcc/9.2.0 cuda/11.2 localcolabfold/latest

or, if LocalColabFold has been installed locally, please make sure it is visible in your PATH variable for loading. Once you have loaded these modules, you’ll want to submit your job to the gpu (or gpu_quad if you have access) partition so that you can leverage GPU resources (Using O2 GPU resources ). Parameters for using LocalColabFold through the command colabfold_batch can be shown by loading the modules above in an interactive session (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Interactive-Sessions ) and running:

Code Block
colabfold_batch -h
Note

At the moment, we do not recommend invoking ‘--amber’ or ‘--templates’ flags since these lead some jobs to fail

LocalColabFold accepts FASTA files containing amino acid sequences, including complexes where proteins are separated with a colon (:). These FASTA files can contain a single sequence, or a “batch” of several proteins as input. The path to this file should be included in a colabfold_batch command. Below is an simplified example of a colabfold_batch command (graciously provided by the Center for Computational Biomedicine):

Code Block
colabfold_batch --num-recycle 5 \
--model-type AlphaFold2-multimer \
--rank ptmscore \
/PATH/TO/INPUT.fasta \
/PATH/TO/OUTPUT/DIRECTORY

These commands can be combined in an sbatch (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#The-sbatch-command ) script. The resources required to complete an sbatch job may vary by structure and complexity. It is generally best to start with a relatively conservative request for resources, then increase as needed based on information from past jobs. This information can be found using commands like O2sacct (Get information about current and past jobs ). Below is a simplified example of an sbatch script that runs the file INPUT.fasta against colabfold_batch on the gpu partition:

Code Block
#!/bin/bash

#SBATCH --partition=gpugcc/9.2.0 localcolabfold/1.5.2

Parameters for using MMseqs2 through the command colabfold_search are revealed by loading the modules above in an interactive session:

Code Block
$ colabfold_search -h

MMseqs2 accepts .fasta files containing multiple amino acid sequences as input,

Code Block
>firstfastasequencename
MWELRSIAFSRAVFAEFLATLLFVFFGLGSALNWPQALPS
>secondfastasequencename
CSMNPARSLAPAVVTGKFDDHWVFWIGPLVGAILGSLLYN

including complexes where proteins are separated with a colon (:).

Code Block
>firstxsecondcomplex
MWELRSIAFSRAVFAEFLATLLFVFFGLGSALNWPQALPS:CSMNPARSLAPAVVTGKFDDHWVFWIGPLVGAILGSLLYN

These inputs can contain a single sequence, or a "batch" of several proteins as input. The path to this file should be included in a colabfold_search command. We have public databases available in /n/shared_db/ (Public Databases) and we will use these for database paths in the simplified examples below:

Code Block
#Version 14-7e284 (loaded with localcolabfold/1.5.2)
colabfold_search \
--db-load-mode 2 \
--mmseqs mmseqs \
--use-env 1 \
--use-templates 0 \
--threads 3 \
/PATH/TO/INPUT.fasta /n/shared_db/misc/mmseqs2/14-7e284 /PATH/TO/OUTPUT/DIRECTORY

#Previous Versions
colabfold_search --db1 uniref30/uniref30_2103_db \
--db3 colabfold/colabfold_envdb_202108_db \
--mmseqs mmseqs \
--use-env 1 \
--use-templates 0 \
--threads 3 \
/PATH/TO/INPUT.fasta /n/shared_db/misc/mmseqs2 /PATH/TO/OUTPUT/DIRECTORY
Note

If you are using MMseqs2 version 14-7e284, please use /n/shared_db/misc/mmseqs2/14-7e284

These commands can be combined with a sbatch (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#The-sbatch-command ) script. The resources required to complete a LocalColabFold job may vary by structure and complexity. It is generally best to start with a relatively conservative request for resources, then increase as needed based on information from past jobs. This information can be found using commands like O2_jobs_report (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1601699912/Get+information+about+current+and+past+jobs#O2_jobs_report ). Below is a simplified example of an sbatch script that runs the file INPUT.fasta against colabfold_search on the short partition:

Code Block
#!/bin/bash

#SBATCH -c 4                                 # Requested cores
#SBATCH --time=0-12:00                    # Runtime in D-HH:MM format
#SBATCH --partition=short                    # Partition to run in
#SBATCH --mem=24GB                           # Requested Memory
#SBATCH -o %j.out                            # File to which STDOUT will be written, including job ID (%j)
#SBATCH -e %j.err                            # File to which STDERR will be written, including job ID (%j)
#SBATCH --mail-type=ALL                      # PartitionALL toemail runnotification intype
#SBATCH --gres=gpu:1mail-user=<email_address>          # Email to which notifications will be sent

module load gcc/9.2.0 localcolabfold/1.5.2

colabfold_search # GPU resources requested
#SBATCH -c 1                                 # Requested cores
#SBATCH --time=0-12:00                    # Runtime in D-HH:MM format
#SBATCH --mem=25GB                           # Requested Memory
#SBATCH -o %j.out                            # File to which STDOUT will be written, including job ID (%j)
#SBATCH -e %j.err                            # File to which STDERR will be written, including job ID (%j)
#SBATCH --mail-type=ALL                      # ALL email notification type
#SBATCH --mail-user=<email_address>          # Email to which notifications will be sent

module load gcc/6.2.0 cuda/11.2 localcolabfold/latest

colabfold_batch --num-recycle 5 \
--model-type AlphaFold2-multimer \
--rank ptmscore \
/PATH/TO/INPUT.fasta \
/PATH/TO/OUTPUT/DIRECTORY

The output directory will contain several .pdb, .json, and .png files for the predicted complex structure. This includes pLDDT and PAE metrics that assess the accuracy of each prediction. The 'best' ranked structure will be called {sample_id}_unrelaxed_rank_1_model_{i}.pdb .

Caveats

LocalColabFold is a repackaging of ColabFold for local use. This means that LocalColabFold requires all the same local hardware resources and connections that ColabFold would require (but without the Google Colab soft dependency). This includes the allowing of shipping the protein sequence to a remote server maintained by the ColabFold developers for processing during the alignment step. This server is shared by all users of ColabFold, and is not an HPC environment to our knowledge. This means that SUBMITTING LARGE BATCHES OF PROTEINS IS NOT RECOMMENDED AT THIS TIME, regardless of whether you are using the O2 module or your own installation on O2. Do note that we are unable to quantify “large”, as this is to the discretion of the system administrators maintaining the remote server.

Large volumes of submissions to the remote server may cause the submitting compute node’s IP address to be rate-limited, or even blacklisted, which will impact all users of LocalColabFold on O2 that land on that compute node. Furthermore, because there are a limited number of compute nodes with GPU resources, if volume is high enough, all of O2’s GPU compute nodes can easily be blacklisted in a short amount of time.

Research Computing is working on a method to perform the alignment step locally, but it may result in longer processing times per protein. This page will be updated when this method is ready for consumption.

Troubleshooting/FAQ

Errors with using --amber or --templates

As noted above, occasionally jobs will fail if either of the above flags are enabled - this is a known issue and requires action from the ColabFold developers. For now, simply resubmit the job without these flags, or if these functions are required for your work, you can also try submitting your sequences against Alphafold (and adjust your resource requirements accordingly).

...

\
--db-load-mode 2 \
--mmseqs mmseqs \
--use-env 1 \
--use-templates 0 \
--threads 4 \
/PATH/TO/INPUT.fasta /n/shared_db/misc/mmseqs2/14-7e284 /PATH/TO/OUTPUT/DIRECTORY

The output should include MSAs in .a3m format. These can be submitted to LocalColabFold as input in the next section, similar to a FASTA file.

Executing LocalColabFold On O2

LocalColabFold can be loaded as a module by running:

Code Block
$ module load gcc/9.2.0 localcolabfold/1.5.2

Once you have loaded these modules, you’ll want to submit your job to the gpu (or gpu_quad if you have access) partition so that you can leverage GPU resources (Using O2 GPU resources ). Parameters for using LocalColabFold through the command colabfold_batch can be shown by loading the modules above in an interactive session (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Interactive-Sessions ) and running:

Code Block
$ colabfold_batch -h
Note

At the moment, we do not recommend invoking ‘--amber’ or ‘--templates’ flags since these lead some jobs to fail

LocalColabFold accepts both .a3m and .fasta files containing amino acid sequences, including complexes where proteins are separated with a colon (:). These inputs can contain a single sequence, or a "batch" of several proteins as input. The path to this file should be included in a colabfold_batch command. Below is a simplified example of a colabfold_batch command (graciously provided by the Center for Computational Biomedicine):

Code Block
$ colabfold_batch --num-recycle 5 \
--model-type auto \
--rank auto \
/PATH/TO/INPUT \
/PATH/TO/OUTPUT/DIRECTORY

Similar to the previous scripts mentioned, it is best to start with a modest resource request and slowly increase as needed. Below is a simplified example of a sbatch script that runs the file INPUT.fasta against colabfold_batch on the gpu partition:

Code Block
#!/bin/bash

#SBATCH --partition=gpu                      # Partition to run in
#SBATCH --gres=gpu:1                         # GPU resources requested
#SBATCH -c 1                                 # Requested cores
#SBATCH --time=0-12:00                    # Runtime in D-HH:MM format
#SBATCH --mem=25GB                           # Requested Memory
#SBATCH -o %j.out                            # File to which STDOUT will be written, including job ID (%j)
#SBATCH -e %j.err                            # File to which STDERR will be written, including job ID (%j)
#SBATCH --mail-type=ALL                      # ALL email notification type
#SBATCH --mail-user=<email_address>          # Email to which notifications will be sent

module load gcc/9.2.0 localcolabfold/1.5.2

colabfold_batch --num-recycle 5 \
--model-type auto \
--rank auto \
/PATH/TO/INPUT \
/PATH/TO/OUTPUT/DIRECTORY
Note

ColabFold does NOT support multiple GPUs. Please refrain from requesting more than one GPU per colabfold_batch invocation, as this will not speed up your run time, and will inhibit your ability to have your job dispatched in a timely manner.

The output directory will contain several .pdb, .json, and .png files for the predicted complex structure. This includes pLDDT and PAE metrics that assess the accuracy of each prediction. The 'best' ranked structure will be called {sample_id}_unrelaxed_rank_1_model_{i}.pdb .

Caveats

LocalColabFold is a repackaging of ColabFold for local use. This means that LocalColabFold requires all the same local hardware resources and connections that ColabFold would require (but without the Google Colab soft dependency). This includes the allowing of shipping the protein sequence to a remote server maintained by the ColabFold developers for processing during the alignment step. This server is shared by all users of ColabFold, and is not an HPC environment to our knowledge. This means that LARGE BATCHES OF PROTEIN ALIGNMENTS MUST BE GENERATED LOCALLY USING MMSEQS2, regardless of whether you are using the O2 module or your own installation on O2. At this time, the developers define large as “a few thousand” sequences. This could change, and is at the discretion of the system administrators maintaining the remote server. Please be considerate of other ongoing analysis on O2 when submitting large queries.

Large volumes of submissions to the remote server may cause the submitting compute node’s IP address to be rate-limited, or even blacklisted, which will impact all users of LocalColabFold on O2 that land on that compute node. Furthermore, because there are a limited number of compute nodes with GPU resources, if volume is high enough, all of O2’s GPU compute nodes can easily be blacklisted in a short amount of time.

Troubleshooting/FAQ

Errors with using --amber or --templates

As noted above, occasionally jobs will fail if either of the above flags are enabled - this is a known issue and requires action from the ColabFold developers. For now, simply resubmit the job without these flags, or if these functions are required for your work, you can also try submitting your sequences against Alphafold (and adjust your resource requirements accordingly).

Please contact rchelp@hms.harvard.edu with any questions regarding the module or troubleshooting the installation process that this section does not address or addresses insufficiently. Depending on the question, we may need to refer you to the developers, but we will do our best to assist.

“I was using localcolabfold/latest, and now it’s gone! What happened to it?”

The module formerly known as localcolabfold/latest is now localcolabfold/1.3.0. At the time this module was first made available, the versioning was not as clear-cut (primarily, the relationship between localcolabfold versions and colabfold releases was not standardized). However, it is now, and it would be improper for a latest version to not actually be the latest version, so we have renamed the module accordingly in accordance to its true associated colabfold release version. If you have workflows or pipelines revolving around loading this module, please change them to load 1.3.0 instead. It is worth mentioning that 1.3.0 and 1.5.2 do NOT use the same AlphaFold models, so you should consider them to be incompatible with each other in terms of research reproducibility. If you were using localcolabfold/latest for an existing project, we recommend continuing with 1.3.0, or restarting entirely with 1.5.2, and not mix-and-matching.

localcolabfold/1.3.0 (formerly localcolabfold/latest) sometimes hangs or crashes on the alignment step

We have recently discovered that the developers changed the way they handle alignment requests to the remote mmseqs2 server. This includes how the request is structured and sent/received. Unfortunately, this means that alignment functionality with versions older than 1.5.2 may be impacted. Presently, the workaround for this is to generate your alignments locally with our mmseqs modules. For reference, you may check to see which mmseqs2 module is used with each LocalColabFold installation by loading the corresponding module and typing module list.