NOTICE: FULL O2 Cluster Outage, January 3 - January 10th
O2 will be completely offline for a planned HMS IT data center relocation from Friday, Jan 3, 6:00 PM, through Friday, Jan 10
- on Jan 3 (5:30-6:00 PM): O2 login access will be turned off.
- on Jan 3 (6:00 PM): O2 systems will start being powered off.
This project will relocate existing services, consolidate servers, reduce power consumption, and decommission outdated hardware to improve efficiency, enhance resiliency, and lower costs.
Specifically:
- The O2 Cluster will be completely offline, including O2 Portal.
- All data on O2 will be inaccessible.
- Any jobs still pending when the outage begins will need to be resubmitted after O2 is back online.
- Websites on O2 will be completely offline, including all web content.
More details at: https://harvardmed.atlassian.net/l/cp/1BVpyGqm & https://it.hms.harvard.edu/news/upcoming-data-center-relocation
GATK4 Mutect2 using Singularity Container
Â
This page shows you how to run GATK4 using our recently installed Singularity GATK4 container. The runAsPipeline
 script, accessible through the rcbio/1.0
 module, converts the bash script into a pipeline that easily submits jobs to the Slurm scheduler for you.
Features of this pipeline:
Given a sample sheet, generate folder structure for data processing
Submit each step as a cluster job usingÂ
sbatch
.Automatically arrange dependencies among jobs.
Email notifications are sent when each job fails or succeeds.
If a job fails, all its downstream jobs automatically are killed.
When re-running the pipeline on the same data folder, if there are any unfinished jobs, the user is asked to kill them or not.
When re-running the pipeline on the same data folder, the user is asked to confirm to re-run or not if a step was done successfully earlier.
Please read below for an example.
The workflows are downloaded from: https://github.com/gatk-workflows/gatk4-data-processing and  https://github.com/gatk-workflows/gatk4-somatic-snvs-indels
Jumpstart
Here are the commands to test out the workflow using example data. The whole run needs a few hours if the cluster is not busy.Â
ssh user123@o2.hms.harvard.edu
# set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715
cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc
screen
srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash
mkdir -p /n/scratch/users/${USER:0:1}/$USER/testGATK4
cd /n/scratch/users/${USER:0:1}/$USER/testGATK4
module load gcc/6.2.0 python/2.7.12 rcbio/1.3.3
export PATH=/n/shared_db/singularity/hmsrc-gatk/bin:/opt/singularity/bin:$PATH
# setup database. Only need run this once. It will setup database in home, so make sure you have at least 5G free space at home.
setupDB.sh
cp /n/shared_db/singularity/hmsrc-gatk/scripts/* .
buildSampleFoldersFromSampleSheet.py sampleSheet.xlsx
runAsPipeline fastqToBam.sh "sbatch -p short --mem 4G -t 1:0:0 -n 1" noTmp run 2>&1 | tee output.log
# check email or use this command to see the workflow progress
squeue -u $USER -o "%.18i %.9P %.28j %.8u %.8T %.10M %.9l %.6D %R %S"
# After all jobs finish run, run this command to start database, and keep it running in background
runDB.sh &
java -XX:+UseSerialGC -Dconfig.file=your.conf -jar /n/shared_db/singularity/hmsrc-gatk/cromwell-43.jar run processing-for-variant-discovery-gatk4.wdl -i unmappedBams/group1/in.json 2>&1 | tee -a group1.log
java -XX:+UseSerialGC -Dconfig.file=your.conf -jar /n/shared_db/singularity/hmsrc-gatk/cromwell-43.jar run processing-for-variant-discovery-gatk4.wdl -i unmappedBams/group2/in.json 2>&1 | tee -a group2.log
setupJson.sh
java -XX:+UseSerialGC -Dconfig.file=your.conf -jar /n/shared_db/singularity/hmsrc-gatk/cromwell-43.jar run mutect2.wdl -i unmappedBams/exon.json && findVCF.sh
# Stop database
killall runDB.sh
Details
Log on to O2
If you need help connecting to O2, please review the Using Slurm Basic wiki page.
From Windows, use the graphical PuTTY program to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.
From a Mac Terminal, use the ssh
 command, inserting your HMS ID instead of user123:
ssh user123@o2.hms.harvard.edu
# set up screen software: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715
cp /n/shared_db/misc/rcbio/data/screenrc.template.txt ~/.screenrc
screen # start screen session. For detail: https://wiki.rc.hms.harvard.edu/pages/viewpage.action?pageId=20676715
Start an interactive job, and create a working folder
srun --pty -p interactive -t 0-12:0:0 --mem 16000MB -n 2 /bin/bash
mkdir -p /n/scratch/users/${USER:0:1}/$USER/testGATK4
cd /n/scratch/users/${USER:0:1}/$USER/testGATK4
Load the pipeline-related modules
Build some testing data in the current folder
Take a look at the example files
The original bash script
How does this bash script work?
There is a loop that goes through the two group folders and samples, convert the fastq files into unmapped bam files
These comments are recgnized by our pipeline runner and the command following them are submitted as slurm jobs:Â
This command will generate new bash script named slurmPipeLine.201907200946.sh
 in flag folder (201907200946 is the timestamp that runAsPipeline
 was invoked at). Then test run it, meaning does not really submit jobs, but only create a fake job id, 123
 for each step. If you were to append runÂ
at the end of the command, the pipeline would actually be submitted to the Slurm scheduler.
Ideally, with 'useTmp', the software should run faster using local /tmp
 disk space for database/reference than the network storage. For this workflow, we don't need it.
Sample output from the test run
Note that only step 2 used -t 50:0
, and all other steps used the default -t 10:0
. The default walltime limit was set in the runAsPipeline
 command, and the walltime parameter for step 2 was set in the bash_script_v2.sh
Run the pipeline
Thus far in the example, we have not actually submitted any jobs to the scheduler. To submit the pipeline, you will need to append the run
 parameter to the command. If run
 is not specified, test
 mode will be used, which does not submit jobs and gives the placeholder of 123Â
for jobids in the command's output.Â
Monitoring the jobs
You can use the command:
To see the job status (running, pending, etc.). You also get two emails for each step, one at the start of the step, one at the end of the step.
Check job logs
You can use the command:
This command list all the logs created by the pipeline runner. *.sh files are the slurm scripts for eash step, *.out files are output files for each step, *.success files means job successfully finished for each step and *.failed means job failed for each steps.
You also get two emails for each step, one at the start of the step, one at the end of the step.
Re-run the pipeline in case some jobs fail
You can rerun this command in the same folder
This command will check if the earlier run is finished or not. If not, ask user to kill the running jobs or not, then ask user to rerun the successfully finished steps or not. Click 'y', it will rerun, directly press 'enter' key, it will not rerun.Â
Call variance:
To run the workflow on your own data
To instead run a workflow on your own data, transfer the sample sheet to your local machine following this wiki page and modify the sample sheet. Then you can transfer it back to O2 under your account, then go to the build folder structure step.
Let us know if you have any questions. Please include your working folder and commands used in your email. Any comments and suggestions are welcome!