Nextflow (`nf-core`) on O2

Nextflow is a workflow system for creating scalable, portable, and reproducible workflows. It is based on the dataflow programming model, which greatly simplifies the writing of parallel and distributed pipelines, allowing you to focus on the flow of data and computation. Due to software requirements, HMS IT is unable to provide global support for Nextflow-based workflows. However, general guidance and procedures for setting up a local pipeline are outlined below.

Installing Nextflow (and `nf-core`)

Users generally have a choice to make here - Nextflow is the generic option that allows for workflow creation and customization from scratch. nf-core is simultaneously a repository of existing pipelines as well as utility with which to execute these pipelines, as a drop-in replacement for (most of) Nextflow itself.

Installing Nextflow does NOT install nf-core, but installing nf-core DOES install nextflow.

Nextflow

To install Nextflow alone, Java is required. It is is sufficient to load one of the existing java modules:

$ module load java/jdk-21.0.2

( java/jdk-21.0.2 is the newest available Java module at the time of writing.)

Next, navigate to where you would like the nextflow executable to be generated, then run the following (in an interactive session):

$ curl -s https://get.nextflow.io | bash

If at this point you would like to relocate the executable to somewhere more convenient, you may do so with mv. Make sure the directory containing the nextflow executable is on your PATH variable, and you are ready to use Nextflow.

$ export PATH=/path/to/nextflow:$PATH

`nf-core`

To install nf-core (and Nextflow together), the easiest method is to create a conda environment. See Conda on O2 for generic instructions to create a conda environment. The command to use is below:

$ module load miniconda3/23.1.0
$ conda create --name nf-core python=3.12 nf-core nextflow

Next, activate the environment:

$ source activate nf-core

And nf-core, as well as Nextflow, should be available to use.

Creating Custom Nextflow Pipelines

Unfortunately, HMS IT is unable to support custom workflow creation below a surface level due to high amounts of customization involved. If a user is interested in creating their own Nextflow workflow, please see the Nextflow documentation for guidance on how to set up the structure correctly. HMS IT may be able to make recommendations related to resource requirements and such on O2.

Executing `nf-core` Pipelines

Users that are interested in leveraging existing nf-core workflows may do so using the nf-core utility that they had installed via the instructions above. Generally, these workflows are invoked with the singularity profile for reproducibility purposes, though the conda profile is also supported on O2.

O2 does not support software execution profiles other than singularity at this time.

If using the singularity profile, it is necessary to move the associated containers to a whitelisted directory, per O2 containerization policy.

With the nf-core conda environment active, download the containers associated with the pipeline with this command (while in an interactive session):

(nf-core)$ nf-core download -x none --container-system singularity --parallel-downloads 8 nf-core/PIPELINENAME

Where PIPELINENAME is the name of the nf-core pipeline as notated on the nf-core website or associated GitHub repository.

At this point, please contact HMS IT at rchelp@hms.harvard.edu for assistance with moving these containers to the whitelisted location.

Once the containers are in the appropriate location, the NXF_SINGULARITY_CACHEDIR variable needs to be set before executing the pipeline:

(nf-core)$ export NXF_SINGULARITY_CACHEDIR=/n/app/singularity/containers/HMSID/nf-core/PIPELINENAME/PIPELINEVERSION

This will allow you to execute containers associated with PIPELINEVERSION of PIPELINENAME without having to re-download the containers locally. This directory may change - desired paths can be negotiated on a per-request basis, but we recommend the nf-core/PIPELINENAME/PIPELINEVERSION organization method such that users and labs can juggle multiple pipelines and versions if desired (though we request that if a user or lab no longer has need for an older version, they request that those containers be deleted/removed).

Obviously, this variable will need to be reset depending on the pipeline being executed.

From there, modification of the pipeline configuration files may be necessary. To start, there should be a nextflow.config file located at /path/to/nf-core-PIPELINENAME_VERSION/SLIGHTLYDIFFERENTLOOKINGVERSION/nextflow.config. This file will contain parameter settings associated with various steps in the workflow, as well as global maximum resource requirements.

Integrating Pipelines with `slurm`

Nextflow/nf-core does not provide HPC resource utilization out of the box via standard workflow configurations, but it can be configured manually.

Boilerplate O2 Configuration File

The following is an example of a configuration file that will allow you to submit individual steps of the pipeline as jobs to O2’s slurm scheduler. Presently, only short, medium, and long partitions are represented. If you would like to leverage a different partition (such as a contributed partition), please make edits to this file accordingly.

Presently there is no boilerplate configuration available for GPU utilization. Please contact rchelp@hms.harvard.edu for inquiries about leveraging GPUs with your Nextflow/nf-core pipeline via slurm.

Paste this configuration into a text file on O2 somewhere (for example at the current working directory), and save it as something like nextflow_slurm.config. You can then invoke your pipeline using nextflow ... -c nextflow_slurm.config -profile cluster,singularity to prioritize this configuration file above (in addition to) the existing workflow configurations.

//Use the params to define reference files, directories, and CLI options
params {

    config_profile_description = 'HMS RC test nextflow/nf-core config'
    config_profile_contact = 'rchelp@hms.harvard.edu'
    config_profile_url = 'rc.hms.harvard.edu'
    max_memory = 250.GB
    // maximum number of cpus and time for slurm jobs
    max_cpus = 20
    max_time = 30.d

}

profiles {

    singularity {
        singularity.enabled = true
        singularity.autoMounts = true
    }

    cluster {
        process {
			executor = 'slurm'
			cache = 'lenient'
			queue = { task.time > 5.d ? 'long' : task.time <= 12.h ? 'short' : 'medium'}
        }
    }

    local {
        process.executor = 'local'
    }
}

executor {
    $slurm {
        queueSize = 1900
		submitRateLimit = '20 sec'
    }
}

//Miscellaneous CLI flags
resume = true

Configuration File specification

The following is a brief summary of each section of this configuration file and their functions:

params designates global maximum job allocation parameters, as well as configuration metadata.
- max_memory is a global limit on how much memory any single job can request of the scheduler.
- max_cpus is a global limit on how many cores any single job can request of the scheduler.
- max_time is a global limit on how much wall time (real-life duration) any single job can request of the scheduler. 30.d (30 days) is the hard limit.

If you have access to resources that may allow you more than these values, you can consider modifying them accordingly.

profiles describes methods by which the pipeline can be invoked. This is specified at execution time via nextflow ... -p profilename1,profilename2,.... At least one profile name must be specified. The profile names in this file are in addition to the default profiles (the singularity profile in this file augments the default singularity profile implemented by Nextflow, etc.).
- the singularity profile sets parameters to allow usage of Singularity containers on O2 to execute pipeline steps. You shouldn’t need to mess with this profile.
- the cluster profile sets parameters to allow submission of pipeline steps via O2’s slurm scheduler.
  - the only parameter you may be interested in is the queue parameter, which governs which partition a pipeline step is submitted to.
    - If a pipeline step requires less than 12 hours, it is submitted to short. If less than 5 days, medium. Otherwise, long.
    - If you have access to additional partitions (such as mpi, highmem, contributed partitions, etc.), set queue accordingly.
      - Keep in mind that such special partitions do not have the same time governances (other than the 30 day limit) on them, so if you would like to integrate one or more of these partitions with the existing short / medium / long paradigm, you will likely need to modify one or more of the pipeline-specific configuration files as well. Please contact rchelp@hms.harvard.edu with inquiries about this.
      - If you are planning to use a specialized partition exclusively, then simply overwrite the queue specification with that partition name.
- the local profile invites the pipeline to be executed within your existing resource allocation (e.g., inside the active interactive session). You need to make sure you have requested the MAXIMUM of cores and memory desired by any one step of the pipeline in order for this profile to execute successfully.
executor describes how the pipeline processes will be run (such as on local compute resources, on cloud resources, or by interacting with a cluster compute scheduler). The executor will keep track of each of the processes, and if they succeed or fail.
- When using the slurm executor, Nextflow can submit each process in the workflow as an sbatch job.
  - Additional parameters that govern the Slurm job submission process are queueSize and submitRateLimit. The queueSize is how many tasks can be processed at one time; here we use 1900 tasks. The submitRateLimitis the maximum number of jobs that will be submitted for a specified time interval. In our file, we limit it to 20 jobs submitted per second.
- When using the local executor, Nextflow will run each process using the resources available on the current compute node.

Cleaning Up After Execution

After your pipeline completes, there will be a work directory at the location where you executed the workflow (not to be confused with your output directory). You may find it useful to occasionally delete this directory, especially if you find that you are using far more space than anticipated. You can keep track of your utilization with the quota-v2 tool (see https://harvardmed.atlassian.net/wiki/spaces/O2/pages/1588662343/Filesystem+Quotas#Checking-Usage ).

Note that there will be a work directory at every location where you have ever executed a pipeline, so you may need to remove multiple work directories if you do not have an established means of organization for juggling multiple workflows.

Also note that resume (checkpointing) functionality will not work if you remove the work directory for a given workflow - it will think you are starting over from the beginning.