Batch small jobs together as a big job

 

Many bioinformatics workflows run the same command on multiple files. When the file sizes are small, the command could only take a few seconds to finish.If you submit the processing of each file as a job often causes the job scheduler to complain of short-running jobs.

Here we will show you how to batch multiple small jobs into a large job with an example.

You can cut and paste the commands below onto the O2 command line to get the idea how it works.

Note: When copying/pasting commands, you can include any text starting with #. They will be ignored by Linux.

Log on to O2

If you need help connecting to O2, please review the How to login to O2 wiki page.

From Windows, use MobaXterm or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

From a Mac Terminal, use the ssh command, inserting your eCommons ID instead of user123:

1 ssh user123@o2.hms.harvard.edu

Start interactive job, and create working folder

Create a working directory on scratch3 and change into the newly-created directory. For example, for user abc123, the working directory will be

1 2 3 srun --pty -p interactive -t 0-12:0:0 --mem 2000M -n 1 /bin/bash mkdir /n/scratch3/users/a/abc123/testBatchJob cd /n/scratch3/users/a/abc123/testBatchJob

Copy some testing data to current folder

1 cp /n/groups/shared_databases/rcbio/rsem/two_group_input/group1/* .

Take a look at the files

1 2 3 4 5 6 7 8 ls -l * -rw------- 1 ld32 ld32 172M Feb 11 13:26 t1_s1_1.fq -rw------- 1 ld32 ld32 172M Feb 11 13:26 t1_s1_2.fq -rw------- 1 ld32 ld32 178M Feb 11 13:26 t1_s2_1.fq -rw------- 1 ld32 ld32 178M Feb 11 13:26 t1_s2_2.fq # There are 4 fastq files.

Let us  work on them one by one: 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 # We want to convert each of them into fasta format, the command is available in module fastx # you can run the command on the files one by one as shown below: module load fastx/0.0.13 for fq in *fq; do echo submitting job for $fq sbatch -p short -t 0-0:10:0 --mem 20M --mail-type=END --wrap "fastq_to_fasta -Q33 -i $fq -o ${fq%.fq}.fa" done # you should see output: submitting job for t1_s1_1.fq Submitted batch job 34710674 submitting job for t1_s1_2.fq Submitted batch job 34710675 submitting job for t1_s2_1.fq Submitted batch job 34710676 submitting job for t1_s2_2.fq Submitted batch job 34710677 # After a few minutes, if you check the job reports, the jobs only ran a few seconds to finish. O2sacct 34710674 # Output JobID Partition State NodeList Start Timelimit Elapsed CPUTime TotalCPU AllocTRES MaxRSS ------------ ---------- -------------- ---------------------- -------------------- -------------- -------------- ---------- ---------- ------------------------- ---------- 34710674 COMPLETED compute-a-16-166 2019-02-22T07:36:51 00:00:00 00:00:00 00:02.829 34710674.ba+ COMPLETED compute-a-16-166 2019-02-22T07:36:44 00:00:07 00:00:07 00:02.829 cpu=1,mem=0.02G,node=1 0.00G # The jobs only run a few seconds. It is not efficient for the scheduler, so it is better to run all of them 4 files in same job. # Or if you prefer to use Slurm script to submit jobs: module load fastx/0.0.13 for fq in *fq; do echo submitting job for $fq sbatch job.sh $fq done # In job.sh: #!/bin/bash #SBATCH -p short #SBATCH -t 0-00:10:00 #SBATCH --mem=20M #SBATCH --mail-type=END fastq_to_fasta -Q33 -i $1 -o ${1%.fq}.fa

Batch them together:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 # We can process all the files in single job and run this code as a single command: sbatch -p short -t 0-0:10:0 --mem 20M --mail-type=END --wrap "for fq in *fq; do fastq_to_fasta -Q33 -i \$fq -o \$fq.fa; done" # Notice above variable $fq itself needs to be passed to the job, so the '$' needs to be escaped. # If you check the job report, the job now ran about half minute to finish. # But how about if you have 4800 files? If you submit them as a single job together, it will run about 5 hours. # It is a little long for us to wait. We can divide them into 5 jobs, each process 1000 files, will run about 1 hour: for file in *fq; do #put the file into batch batch="$batch $file" counter=$[counter +1] # when counter is multiple of 1000, such as 1000, 2000, 3000, and so on, submit the batch of files as a new job if (( $counter % 1000 == 0 )); then echo submitting: $counter files: $batch sbatch -p short -t 0-1:0:0 --mem 20M --mail-type=END --wrap "for fq in $batch; do fastq_to_fasta -Q33 -i \$fq -o \${fq%.fq}.fa; done" # get ready for the next batch batch="" fi done # if the total number of files is not multiple of 1000, in this example it is 4800, after submitted the first 4000, there are still 800 files need to process [ -z "$batch" ] || { echo submitting: $counter files: $batch; sbatch -p short -t 0-0:10:0 --mem 20M --mail-type=END --wrap "for fq in $batch; do fastq_to_fasta -Q33 -i \$fq -o \${fq%.fq}.fa; done"; } # Or if you prefer to use slurm script to submit jobs: module load fastx/0.0.13 for file in *fq; do #put the file into batch batch="$batch $file" counter=$[counter +1] # when counter is multiple of 1000, such as 1000, 2000, 3000, and so on, submit the batch of files as a new job if (( $counter % 1000 == 0 )); then echo submitting: $counter files: $batch sbatch job.sh "$batch" # get ready for the next batch batch="" fi done # if the total number of files is not multiple of 1000, in this example it is 4800, after submitted the first 4000, there are still 800 files need to process [ -z "$batch" ] || { echo submitting: $counter files: $batch; sbatch job.sh "$batch"; } # In job.sh #!/bin/bash #SBATCH -p short #SBATCH -t 0-01:00:00 #SBATCH --mem=20M #SBATCH --mail-type=END for fq in $1; do fastq_to_fasta -Q33 -i $fq -o ${fq%.fq}.fa done

Let us know if you have any question. Please include your working folder and commands used in your email. Any comment and suggestion are welcome!