Batch small jobs together as a big job

Many bioinformatics workflows run the same command on multiple files. When the file sizes are small, the command could only take a few seconds to finish.If you submit the processing of each file as a job often causes the job scheduler to complain of short-running jobs.

Here we will show you how to batch multiple small jobs into a large job with an example.

You can cut and paste the commands below onto the O2 command line to get the idea how it works.

Note: When copying/pasting commands, you can include any text starting with #. They will be ignored by Linux.

Log on to O2

If you need help connecting to O2, please review the How to login to O2 wiki page.

From Windows, use MobaXterm or PuTTY to connect to o2.hms.harvard.edu and make sure the port is set to the default value of 22.

From a Mac Terminal, use the ssh command, inserting your HMS ID instead of user123:

ssh user123@o2.hms.harvard.edu

Start interactive job, and create working folder

Create a working directory on scratch and change into the newly-created directory. For example, for user abc123, the working directory will be

srun --pty -p interactive -t 0-12:0:0 --mem 2000M -n 1 /bin/bash
mkdir /n/scratch/users/a/abc123/testBatchJob  
cd /n/scratch/users/a/abc123/testBatchJob

Copy some testing data to current folder

cp /n/groups/shared_databases/rcbio/rsem/two_group_input/group1/* .

Take a look at the files

ls -l *
-rw------- 1 ld32 ld32 172M Feb 11 13:26 t1_s1_1.fq
-rw------- 1 ld32 ld32 172M Feb 11 13:26 t1_s1_2.fq
-rw------- 1 ld32 ld32 178M Feb 11 13:26 t1_s2_1.fq
-rw------- 1 ld32 ld32 178M Feb 11 13:26 t1_s2_2.fq


# There are 4 fastq files.

Let us work on them one by one:

# We want to convert each of them into fasta format, the command is available in module fastx
# you can run the command on the files one by one as shown below:
module load fastx/0.0.13

for fq in *fq; do
    echo submitting job for $fq
    sbatch -p short -t 0-0:10:0 --mem 20M --mail-type=END --wrap "fastq_to_fasta  -Q33 -i $fq -o ${fq%.fq}.fa"
done


# you should see output:
submitting job for t1_s1_1.fq
Submitted batch job 34710674

submitting job for t1_s1_2.fq
Submitted batch job 34710675

submitting job for t1_s2_1.fq
Submitted batch job 34710676

submitting job for t1_s2_2.fq
Submitted batch job 34710677

# After a few minutes, if you check the job reports, the jobs only ran a few seconds to finish.
O2sacct 34710674

# Output
       JobID  Partition          State               NodeList                Start      Timelimit        Elapsed    CPUTime   TotalCPU                 AllocTRES     MaxRSS 

------------ ---------- -------------- ---------------------- -------------------- -------------- -------------- ---------- ---------- ------------------------- ---------- 

34710674                     COMPLETED       compute-a-16-166  2019-02-22T07:36:51                      00:00:00   00:00:00  00:02.829                                      

34710674.ba+                 COMPLETED       compute-a-16-166  2019-02-22T07:36:44                      00:00:07   00:00:07  00:02.829    cpu=1,mem=0.02G,node=1      0.00G 

# The jobs only run a few seconds. It is not efficient for the scheduler, so it is better to run all of them 4 files in same job. 


# Or if you prefer to use Slurm script to submit jobs: 
module load fastx/0.0.13
for fq in *fq; do
    echo submitting job for $fq
    sbatch job.sh $fq 
done

# In job.sh: 
#!/bin/bash
#SBATCH -p short
#SBATCH -t 0-00:10:00
#SBATCH --mem=20M
#SBATCH --mail-type=END

fastq_to_fasta  -Q33 -i $1 -o ${1%.fq}.fa

Batch them together:

# We can process all the files in single job and run this code as a single command:
sbatch -p short -t 0-0:10:0 --mem 20M --mail-type=END --wrap "for fq in *fq; do fastq_to_fasta  -Q33 -i \$fq -o \$fq.fa; done"

# Notice above variable $fq itself needs to be passed to the job, so the '$' needs to be escaped. 

# If you check the job report, the job now ran about half minute to finish.

# But how about if you have 4800 files? If you submit them as a single job together, it will run about 5 hours. 
# It is a little long for us to wait. We can divide them into 5 jobs, each process 1000 files, will run about 1 hour:
for file in *fq; do
    
	#put the file into batch
    batch="$batch $file"
    counter=$[counter +1]

    # when counter is multiple of 1000, such as 1000, 2000, 3000, and so on, submit the batch of files as a new job
    if (( $counter % 1000 == 0 )); then 
        echo submitting: $counter files: $batch
        sbatch -p short -t 0-1:0:0 --mem 20M --mail-type=END --wrap "for fq in $batch; do fastq_to_fasta  -Q33 -i \$fq -o \${fq%.fq}.fa; done"  
        
        # get ready for the next batch
        batch=""
     fi
done

# if the total number of files is not multiple of 1000, in this example it is 4800, after submitted the first 4000, there are still 800 files need to process
[ -z "$batch" ] || { echo submitting: $counter files: $batch; sbatch -p short -t 0-0:10:0 --mem 20M --mail-type=END --wrap  "for fq in $batch; do fastq_to_fasta  -Q33 -i \$fq -o \${fq%.fq}.fa; done"; }  


# Or if you prefer to use slurm script to submit jobs: 
module load fastx/0.0.13
for file in *fq; do
    
	#put the file into batch
    batch="$batch $file"
    counter=$[counter +1]

    # when counter is multiple of 1000, such as 1000, 2000, 3000, and so on, submit the batch of files as a new job
    if (( $counter % 1000 == 0 )); then 
        echo submitting: $counter files: $batch
        sbatch job.sh "$batch"   
        
        # get ready for the next batch
        batch=""
     fi
done

# if the total number of files is not multiple of 1000, in this example it is 4800, after submitted the first 4000, there are still 800 files need to process
[ -z "$batch" ] || { echo submitting: $counter files: $batch; sbatch job.sh "$batch"; }  


# In job.sh
#!/bin/bash
#SBATCH -p short
#SBATCH -t 0-01:00:00
#SBATCH --mem=20M
#SBATCH --mail-type=END

for fq in $1; do 
	fastq_to_fasta  -Q33 -i $fq -o ${fq%.fq}.fa
done

Let us know if you have any question. Please include your working folder and commands used in your email. Any comment and suggestion are welcome!