Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

ColabFold (https://github.com/sokrypton/ColabFold ) is an emerging protein folding prediction tool based on Google DeepMind’s Alphafold (see Using AlphaFold on O2 ). LocalColabFold (https://github.com/YoshitakaMo/localcolabfold )is a packaging of ColabFold for use on local machines; we provide instructions on how to leverage LocalColabFold on O2 below. LocalColabFold uses MMseqs2 (conditionally faster than jackhmmer), and runs AlphaFold2 for single protein modeling and AlphaFold-Multimer for protein complex modeling. If you are unsure about which to use, feel free to try both tools and compare results.

...

Via internal testing, Research Computing has discovered that even though install_colabbatch_linux.sh installs a local copy of (GCC and) CUDA, it is unable to leverage these resources, and we were unable to provide access to these local copies. The present workaround is to use the O2 modules in their place as above.

Executing LocalColabFold On O2

LocalColabFold can be loaded as a module by running:

...

Generating MSAs Using Local MMseqs2

Generating MSAs locally using MMseqs2 reduces the load on remote servers managed by Colabfold developers, and allows users to run “larger” batches without risk of being limited (see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/2180546561/Using+Local+ColabFold+on+O2#Caveats ). MMseqs2 can be loaded from within the LocalColabFold module:

Code Block
$ module load gcc/9.2.0 cuda/11.2 localcolabfold/latest

Parameters for using MMseqs2 through the command colabfold_search are revealed by loading the modules above in an interactive session:

Code Block
$ colabfold_search -h

MMseqs2 accepts .fasta files containing amino acid sequences as input, including complexes where proteins are separated with a colon (:). These inputs can contain a single sequence, or a "batch" of several proteins as input. The path to this file should be included in a colabfold_search command. We have public databases available in /n/shared_db/ (Public Databases) and we will use these for database paths in the simplified example below:

Code Block
colabfold_search --db1 uniref30/uniref30_2103_db \
--db3 colabfold/colabfold_envdb_202108_db \
--mmseqs mmseqs \
--use-env 1 \
--use-templates 0 \
--threads 3 \
/PATH/TO/INPUT.fasta /n/shared_db/misc/mmseqs2 /PATH/TO/OUTPUT/DIRECTORY

These commands can be combined with a sbatch (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#The-sbatch-command ) script. The resources required to complete a sbatch job may vary by structure and complexity. It is generally best to start with a relatively conservative request for resources, then increase as needed based on information from past jobs. This information can be found using commands like O2sacct (Get information about current and past jobs ). Below is a simplified example of an sbatch script that runs the file INPUT.fasta against colabfold_search on the short partition:

Code Block
#!/bin/bash

#SBATCH -c 4                                 # Requested cores
#SBATCH --time=0-12:00                    # Runtime in D-HH:MM format
#SBATCH --partition=short                    # Partition to run in
#SBATCH --mem=100GB                           # Requested Memory
#SBATCH -o %j.out                            # File to which STDOUT will be written, including job ID (%j)
#SBATCH -e %j.err                            # File to which STDERR will be written, including job ID (%j)
#SBATCH --mail-type=ALL                      # ALL email notification type
#SBATCH --mail-user=<email_address>          # Email to which notifications will be sent

module load gcc/6.2.0 cuda/11.2 localcolabfold/latest

colabfold_search --db1 uniref30/uniref30_2103_db \
--db3 colabfold/colabfold_envdb_202108_db \
--mmseqs mmseqs \
--use-env 1 \
--use-templates 0 \
--threads 3 \
/PATH/TO/INPUT.fasta /n/shared_db/misc/mmseqs2 /PATH/TO/OUTPUT/DIRECTORY

The output should include MSAs in .a3m format. These can be submitted to LocalColabFold as input in the next section, similar to a FASTA file.

Executing LocalColabFold On O2

LocalColabFold can be loaded as a module by running:

Code Block
$ module load gcc/9.2.0 cuda/11.2 localcolabfold/latest

or, if LocalColabFold has been installed locally, please make sure it is visible in your PATH variable for loading. Once you have loaded these modules, you’ll want to submit your job to the gpu (or gpu_quad if you have access) partition so that you can leverage GPU resources (Using O2 GPU resources ). Parameters for using LocalColabFold through the command colabfold_batch can be shown by loading the modules above in an interactive session (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#Interactive-Sessions ) and running:

Code Block
$ colabfold_batch -h
Note

At the moment, we do not recommend invoking ‘--amber’ or ‘--templates’ flags since these lead some jobs to fail

LocalColabFold accepts FASTA both .a3m and .fasta files containing amino acid sequences, including complexes where proteins are separated with a colon (:). These FASTA files inputs can contain a single sequence, or a “batch” "batch" of several proteins as input. The path to this file should be included in a colabfold_batch command. Below is an a simplified example of a colabfold_batch command (graciously provided by the Center for Computational Biomedicine):

Code Block
$ colabfold_batch --num-recycle 5 \
--model-type AlphaFold2-multimer \
--rank ptmscore \
/PATH/TO/INPUT.fasta \
/PATH/TO/OUTPUT/DIRECTORY

These commands can be combined in an a sbatch (https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1586793632/Using+Slurm+Basic#The-sbatch-command ) script. The resources required to complete an a sbatch job may vary by structure and complexity. It is generally best to start with a relatively conservative request for resources, then increase as needed based on information from past jobs. This information can be found using commands like O2sacct (Get information about current and past jobs ). Below is a simplified example of an a sbatch script that runs the file INPUT.fasta against colabfold_batch on the gpu partition:

Code Block
#!/bin/bash

#SBATCH --partition=gpu                      # Partition to run in
#SBATCH --gres=gpu:1                         # GPU resources requested
#SBATCH -c 1                                 # Requested cores
#SBATCH --time=0-12:00                    # Runtime in D-HH:MM format
#SBATCH --mem=25GB                           # Requested Memory
#SBATCH -o %j.out                            # File to which STDOUT will be written, including job ID (%j)
#SBATCH -e %j.err                            # File to which STDERR will be written, including job ID (%j)
#SBATCH --mail-type=ALL                      # ALL email notification type
#SBATCH --mail-user=<email_address>          # Email to which notifications will be sent

module load gcc/6.2.0 cuda/11.2 localcolabfold/latest

colabfold_batch --num-recycle 5 \
--model-type AlphaFold2-multimer \
--rank ptmscore \
/PATH/TO/INPUT.fasta \
/PATH/TO/OUTPUT/DIRECTORY

...

LocalColabFold is a repackaging of ColabFold for local use. This means that LocalColabFold requires all the same local hardware resources and connections that ColabFold would require (but without the Google Colab soft dependency). This includes the allowing of shipping the protein sequence to a remote server maintained by the ColabFold developers for processing during the alignment step. This server is shared by all users of ColabFold, and is not an HPC environment to our knowledge. This means thatSUBMITTING LARGE BATCHES OF PROTEINS IS NOT RECOMMENDED AT THIS TIMEMUST BE GENERATED LOCALLY USING MMSEQS2, regardless of whether you are using the O2 module or your own installation on O2. Do note that we are unable to quantify “large”, as this is to the discretion of the system administrators maintaining the remote server.

Large volumes of submissions to the remote server may cause the submitting compute node’s IP address to be rate-limited, or even blacklisted, which will impact all users of LocalColabFold on O2 that land on that compute node. Furthermore, because there are a limited number of compute nodes with GPU resources, if volume is high enough, all of O2’s GPU compute nodes can easily be blacklisted in a short amount of time.Research Computing is working on a method to perform the alignment step locally, but it may result in longer processing times per protein. This page will be updated when this method is ready for consumption.

Troubleshooting/FAQ

Errors with using --amber or --templates

...