Welcome to the fourth day of the FAS Informatics Bioinformatics Tips and Tricks Workshop!
If you’re viewing this file on the website, you are viewing the final, formatted version of the workshop. The workshop itself will take place in the RStudio program and you will edit the file while executing code in the terminal. Please download the raw file here.
Today you’ll learn more about how to write scripts, control the behavior of your scripts using loops and conditional statements, and more!
Setting up our link to the data files
As with yesterday, we’ll create a symbolic link (analogous to a Windows shortcut) in your current working directory called “data4” that points to the workshop data directory.
Run the following commands in the terminal window to create links to the files in our data directory in your own data directory:
mkdir -p data4
ln -s -f /n/holylfs05/LABS/informatics/Everyone/workshop-data/biotips-2024/day4/* data4
# ln: The Unix link command, which can create shortcuts to folders and files at the provided path to the second provided path
# -s: This option tells ln to create a symbolic link rather than a hard link (original files are not changed)
# -f: This option forces ln to create the link
ls -l data4
# Show the details of the files in the new linked directory
Part 1: The For loop
We introduced scripting yesterday. If you need a refresher on how we
developed our first script, the data4
folder contains different
iterations of “snp-density-*.sh” that step through the process.
Why we might want loops
Loops are useful if you want to repeat the same command multiple times,
while changing something specific about that command each time. For
example, on day 3 we counted the number of genes in the macaque
annotation .gff
file, but what if we wanted to know the number of
genes in all the gff files across our data folders? Or what if we wanted
to extract the different regions of the genome, like introns or exons
from a gff
file? Today we’ll learn how to do those things using the
“For” loop, which executes a sequence of commands once for each item in
a list of items.
Run the following code block by copy-pasting it to your terminal to see a for loop in action. You can also write out the code in the terminal, and note how the prompt changes as you type a multi-line command.
# a simple loop
for i in 1 2 3
do
echo $i
done
# for: loops over a list of values
# i: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Note: bash also has
while
loops that execute the loop body while a condition evaluates to true.
How to build a loop
The components of a loop are:
- The
for
keyword - A variable name (can be any valid shell variable name, but try to make it descriptive and not misleading)
- The
in
keyword - A list of elements to iterate over (there are many ways to generate this list, which we’ll cover later)
- The
do
keyword - The commands to run in the loop (indenting this section is optional, but makes it easier to read)
- The
done
keyword
The for
loop will iterate over the elements in the list, assigning
each element to the variable name in turn. The commands in the loop will
be run once for each element in the list. The “\$” sign is used to
access the value of the variable. After the for
loop, the variable
will subsequently retain its value from the last iteration of the loop.
What is the thought process that goes into writing a loop? First, we start with a command that we want to repeat multiple times. Then, we think about what we want to change each time we run the command. This is the variable. Then, we think about what values we want to assign to the variable. This is the list of elements the loop will iterate over. Finally, we put it all together.
When first testing out a loop, it’s a good idea to make use of the
echo
command to see what your loop will do before it actually does it.
Run the below code to see how echo can be used to do a trial run of a loop.
for i in 1 2 3
do
echo "some_command $i | another_command > $i-result.txt"
done
# for: loops over a list of values
# i: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Let’s find out how many exercises there are in each RMarkdown document (student version).
Exercise: Using the steps above, write a loop that will print out how many exercises there are in each Rmarkdown document
data4/Biotips-workshop-2023-Day1-student.Rmd
,data4/Biotips-workshop-2023-Day2-student.Rmd
, anddata4/Biotips-workshop-2023-Day3-student.Rmd
. You can use thegrep -c
command to count the number of matches to “Exercise” in a file. When iterating over file name, it is a good idea to encase the variable name in double quotes so that any weird characters such as spaces in the file name are parsed correctly. Write your solution in the code block below.
# The command we want to repeat is: grep -c 'Exercise'
# The variable we want to change each time we run the command is: filename
# The values we want to assign to the variable are:
# data3/Biotips-workshop-2024-Day1-student.Rmd
# data3/Biotips-workshop-2024-Day2-student.Rmd
# data3/Biotips-workshop-2024-Day3-student.Rmd
# If we're unsure how it'll go, we can put echo in front of the command
# Put your solution here:
# The command we want to repeat is: grep -c 'Exercise'
# The variable we want to change each time we run the command is: filename
# The values we want to assign to the variable are:
# data3/Biotips-workshop-2024-Day1-student.Rmd
# data3/Biotips-workshop-2024-Day2-student.Rmd
# data3/Biotips-workshop-2024-Day3-student.Rmd
# If we're unsure how it'll go, we can put echo in front of the command
# Put your solution here:
for filename in data3/Biotips-workshop-2024-Day1-student.Rmd data3/Biotips-workshop-2024-Day2-student.Rmd data3/Biotips-workshop-2024-Day3-student.Rmd
do
grep -c 'Exercise' "$filename"
done
# for: loops over a list of values
# i: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Note: A note on using echo to test your code. Run the following two code chunks and compare the result:
for filename in data3/Biotips-workshop-2024-Day2-student.Rmd
do
echo "grep -c 'Exercise' \"$filename\" > $filename-count.txt"
done
for filename in data3/Biotips-workshop-2024-Day2-student.Rmd
do
echo grep -c 'Exercise' "$filename" > $filename-count.txt
done
What happened here? In the first version, the full command was enclosed
in double quotes, and treated as a string to print to the terminal (note
the \"
to escape the inner double quotes, and treat them as literal
"
characters within the string instead of terminating / starting a
string). In the second version, the echo
was only applied to the
grep
, the output of which was then redirected (with >
) into the file
data3/Biotips-workshop-2024-Day2-student.Rmd-count.txt
. You can go to
your data4/
folder in the Files pane and view that text file. When
using echo to test your code, be mindful of which part of your code you
are echoing.
Now let’s figure out a command to get the exercises themselves. Sometimes it’s easier to build out a command outside the loop, working on a single file at a time, so let’s start with just finding the exercises for the second day’s file.
Exercise: Awk can be used to capture multiple lines based on a pattern. The syntax for that is
awk /start/,/end/ file-to-match
. Write a command that will print out the full exercise prompts indata3/Biotips-workshop-2024-Day2-student.Rmd
. What should be your “start” and what should be your “end” in the awk command? Hint: there is an empty line after every exercise prompt. Write your solution in the code block below.
## An awk command that will print out all the exercises in the file data4/Biotips-workshop-2023-Day2-student.Rmd
awk '/Exercise/,/^`/' data3/Biotips-workshop-2024-Day2-student.Rmd
# OR
awk '/Exercise/,/^$/' data3/Biotips-workshop-2024-Day2-student.Rmd
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
Exercise:
Incorporate the working
awk
command into a for loop, iterating over the same.Rmd
files as before.Use
>>
to append the output to a file calledexercises.txt
. Where should you add this code? Write your solution in the code block below.
## A for loop that extracts all the exercises from the RMarkdown files and appends them to a file called exercises.txt
for filename in data3/Biotips-workshop-2024-Day1-student.Rmd data3/Biotips-workshop-2024-Day2-student.Rmd data3/Biotips-workshop-2024-Day3-student.Rmd
do
awk '/^> \*\*Exercise/,/^$/' "$filename" >> exercises.txt
done
# for: loops over a list of values
# filename: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
for filename in data3/*.Rmd
do
awk '/^> \*\*Exercise/,/^$/' "$filename" >> exercises.txt
done
Now you can see how useful loops can be to process multiple files at once!
Adding wildcards to your loop specifications
Enumerating all the files in your for loop can get tedious and messy. Since some of our files all share the same extension, we can use wildcards to specify all the files that match that extension. Wildcards get processed before the loop is run so it essentially generates the list of files the loop iterates over.
Run the following code to see how wildcards work in a for loop:
for filename in data2/*.vcf
do
echo $filename
done
# for: loops over a list of values
# filename: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Note: if the wildcard pattern does not match any existing path (file or directory) name, the pattern will not expand into a list, resulting in a string containing a literal
*
. E.g., the following results in “data2/*.bcf” being echoed:
for filename in data2/*.bcf
do
echo $filename
done
# for: loops over a list of values
# filename: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Exercise: Create a for loop that prints out the path for all the
.Rmd
files we processed earlier. Write your solution in the code block below.
## A for loop that prints out the path for all the .Rmd files we processed earlier
for filename in data3/*student.Rmd
do
echo $filename
done
# for: loops over a list of values
# filename: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Part 2: Using a loop in a script
Last time, we practiced using a for loop in the command line to iterate over three .Rmd files and find the number of exercises in each. Today, we will start writing loops within scripts for added reproducibility and portability. In this section we’ll walk you through how a script comes together conceptually. It starts with an idea for a single command that gradually gets too complicated to keep track of in your head.
We have previously used the bedtools makewindows
function to generate
a bed file with the coordinates of the 100 kilobase windows in the
scaffold for the Amazon molly genome. We can also generate bed files
for different subsets of the annotation we have for the Amazon molly:
coding exons, introns, intergenic regions, UTRs. For example, to extract
the exons from a gff
file, we can use the following awk
command:
awk 'BEGIN{OFS="\t"}; $3=="exon"{print $1, $4-1, $5}' data3/poeFor_NW_006799939.gff > exon.bed
# awk: A command line scripting language command
# '' : Within the single quotes is the user defined script for awk to run on the provided file
# > : The Unix redirect operator to write the output of the command to the following file
We can then repeat the command and substitute “exon” in both the awk and
output bed for “CDS”, “mRNA”, or “gene”, for a total of 4 different bed
files. We could just copy and paste the command and change the file
name, but that’s not very efficient. This is a good candidate for a
loop! In other bash commands, we’ve been able to use $
to denote a
shell variable that we wanted to switch. However, this won’t work inside
an awk command. This is because awk
by default doesn’t have access to
shell variables, but we can pass them to an awk
script with the -v
option. The syntax below…
SHELL_VALUE="query"
awk -v variable_inside_awk=$SHELL_VALUE 'variable_inside_awk{print}' file-to-awk
… will become the following when it’s parsed and executed:
awk 'query{print}' file-to-awk
Exercise: Write a loop to create the 4 different bed files, but
echo
the command since we won’t be running it just yet. Write your solution in the code block below.NOTE: When using
echo
to display anawk
command, some things may not display exactly how you think they should. This is becauseawk
andbash
(the terminal program) share some of the same syntax. For example, bothawk
andbash
use the$
to indicate variable names. When evaluating anawk
command, the terminal knows to use the variable withinawk
. However, when usingecho
to display theawk
command, the$
is interpreted bybash
instead. This can lead to unexpected behavior if, for instance, yourawk
command uses$3
. If$3
does not exist in yourbash
environment an empty string will be displayed byecho
, even though the command would evaluate correctly inawk
.We can get around this by escaping these shared syntactic characters using the
\
symbol. For example,$3
will try to be evaluated bybash
, but\$3
will tell bash to instead literally use the “\$” character.Keep this in mind when you display your commands with
echo
.
for TYPE in exon CDS mRNA gene
do
echo "awk -v type=$TYPE 'BEGIN{OFS=\"\t\"}; \$3==type {print \$1, \$4-1, \$5}' data3/poeFor_NW_006799939.gff > $TYPE.bed"
done
# for: loops over a list of values
# TYPE: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Since we’re creating four bed files, let’s organize them by putting
them all in a folder called poeFor_beds
. So we want to add a command
for making a folder before running this for loop using
mkdir -p poeFor_beds
. Now that the command is more than just one line,
we should probably put it in a script. In RStudio create a new script
from File –> New File –> Shell script. Name it make-beds.sh
. Now
copy and paste what we have so far into the script. Also add on
| bedtools sort -i - | bedtools merge
before the redirect to the
.bed
file. This will sort the bed file and merge overlapping
intervals. You new script should look like the following:
Exercise: Write a script called
make-beds.sh
that redirects the output of the for loop into a new directory calledpoeFor_beds
. Also add in the sorting and merging of overlapping intervals. Write your solution in the code block below (and also modify it in your bash script).
#!/bin/bash
mkdir -p poeFor_beds
for TYPE in exon CDS mRNA gene
do
awk -v type=$TYPE 'BEGIN{OFS="\t"}; $3==type {print $1, $4-1, $5}' data3/poeFor_NW_006799939.gff | bedtools sort -i - | bedtools merge > poeFor_beds/$TYPE.bed
done
# for: loops over a list of values
# TYPE: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
chmod +x make-beds.sh
./make-beds.sh
## Run the make-beds.sh script
Now we can run the script by typing bash make-beds.sh
in the terminal.
You should see 4 new bed files in the poeFor_beds
folder. Great!
But what if we want to run this script on a different gff file? We can
add a variable for the gff file name and change the script to use that
variable instead of the hard-coded file name.
Exercise: Recall what we learned about running bash scripts with variables in the previous session. Add a variable for the gff file name and change the script to use that variable instead of the hard-coded file name. Modify your bash script with your code (and also write it in the code block below if you want).
#!/bin/bash
GFF=$1
mkdir -p poeFor_beds
for TYPE in exon CDS mRNA gene
do
awk -v type=$TYPE 'BEGIN{OFS="\t"}; $3==type {print $1, $4-1, $5}' $GFF | bedtools sort -i - | bedtools merge > poeFor_beds/$TYPE.bed
done
# for: loops over a list of values
# TYPE: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
Now we can run the script by typing
bash make-beds.sh data3/poeFor_NW_006799939.gff
in the terminal. Use
ls poeFor_beds
to confirm that there are 4 bed files in that folder.
Note that the previous bed files were overwritten! Always use caution
when running your scripts. In this case, we are fine with the previous
files being overwritten.
We are almost done; let’s make a few more using bedtools complement
,
which reports all regions of the genome NOT in an interval, and
bedtools subtract
, which removes overlaps. For example, we don’t have
an intergenic
feature in our GFF, but we can create one by getting
everything that is not a gene. And we don’t have a UTR
feature or an
intron
feature, but we can get them by taking our exons and removing
coding sequences (CDS) or by taking genes and removing exons.
Exercise: Add the following lines to the end of your
make-bed.sh
script and re-run the script.
bedtools complement -i poeFor_beds/gene.bed -g data3/poeFor_NW_006799939.fasta.fai > poeFor_beds/intergenic.bed
# bedtools: A suite of programs to process bed files
# complement: The sub-program of bedtools to execute
# -i: The input bed file
# -g: A genome size file, which needs two columns, the scaffold/chromosome name and the size of that feature
# > : The Unix redirect operator to write the output of the command to the following file
bedtools subtract -a poeFor_beds/gene.bed -b poeFor_beds/exon.bed | bedtools sort -i - | bedtools merge > poeFor_beds/intron.bed
bedtools subtract -a poeFor_beds/exon.bed -b poeFor_beds/CDS.bed | bedtools sort -i - | bedtools merge > poeFor_beds/UTR.bed
# Two commands that subtract bed intervals in one file (-b) from another (-a) and pipe both with | and - into other bedtools commands and
# finally redirect to an output bed file.
Run wc -l poeFor_beds/*.bed
to see how many lines are in each bed
file. Note that here we have run a command on multiple files without an
explicit loop, thanks to regex!
Looping over lists of files
In this section we’ll work on another script to continue our analysis of
the Amazon molly genome. We’ll write a script that will calculate the
SNP density for each of the different bed files we created in the
previous section. We already have code for this from the previous
session, which is reproduced below. (If you don’t have the file already,
create a file named snp-density.sh
and copy the below code into it.):
#!/bin/bash
BED=$1
VCF=$2
bedtools intersect -c -a $BED -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{print snps/lens}'
In order to run this, we have to specify a bed file and a VCF file as command line arguments.
Try running this script in your terminal using the command:
It’s not bad, but it still requires a lot of typing if we want to run
this on the all the .bed
files. Plus, we want to run this and
essentially generate a report summarizing the results all in one place.
To better wrap my brain around what kind of code I need, I like to write
what I want in plain English first, then in pseudocode, and then in
real code. Pseudocode is a programming word meaning a sketch of code
that. It should be structured like the code you intend the write, but
without worrying about grammar or syntax. Pseudocode is useful for
creating a template that gets filled in with real code. It can also help
identify logical flow errors and prevent laborious bug-fixing.
Here’s my plain English version:
“I want to loop over the an arbitrary list of bed files and run the snp-density-4.sh script on each one, and then print out the results in a file called snp-densities.txt. The script should be run like bash script-name.sh list-of-bed-files.txt vcf-file.vcf > snp-densities.txt”
Here’s my pseudocode (I didn’t necessarily write this up linearly, but it’s presented here in a linear fashion):
bed_list=a list of bed files
vcf_file=a vcf file
for each bed_file in the bed_list:
print out "the snp density for <bed_file> is:"
run bedtools intersect on the bed_file and the vcf_file then pipe it to the awk statement to get the snp density
print out "----" as a spacer
Now let’s start translating the pseudocode into real code. We’ll start with the for loop. We already know how to write a for loop, but we need to figure out how to get the list of bed files. How would you get a list (in the form of a text file) of the bed files we need process?
Exercise: Write a shell command (not script, this is too simple for a script) to get a list of the bed files we need to process and write it to
bed_files.txt
. Write your solution in the code block below. Usewc -l bed_files.txt
to check the number of.bed
files. Put your command in the code block below.
We’ve got the list of bed files we want. Let’s write out the skeleton of
the loop. Before when we were assembling the list of things to loop
over, we simply typed them out (e.g. for TYPE in exon CDS mRNA gene;
).
However, now we want to get the output from a command, in this case
cat
to get the lines of the file and then pipe that to the for loop.
We can do this by storing the output of that command in a shell variable
and looping over that with this syntax:
for VAR in $(cat FILENAME)
This says that we’re going to run the cat
command on our file, but
instead of displaying the contents of the file to the screen, we’re
going to substitute its output in our script using the $()
(command
substitution) syntax. In this case, we’re not storing the output of
cat
in a named variable (e.g. x=$(cat FILENAME)
), but instead using
it directly in our for
loop. Now, in our loop, each line in FILENAME
will be iterated over and stored as VAR
.
Exercise: Write a for loop that iterates over each bed file in the
bed_files.txt
file and prints out the name of the bed file. Write your solution in the code block below.
## A for loop that prints out the name of the files in bed_files.txt
for BED in $(cat bed_files.txt)
do
echo $BED
done
# for: loops over a list of values
# BED: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
# $(cat bed_files.txt): this runs the cat command on the bed_files.txt file and uses it as input to the loop
Now we have all the parts we need to create our script.
Exercise: Write a script called
snp-density-5.sh
that takes as input a file of bed files and a vcf file and finds the snp density of each bed file within the vcf. Write your solution in the code block below (and also modify it in your bash script). I’ve copied over the pseudocode to help you out.
# bed_list=a list of bed files
# vcf_file=a vcf file
# for each bed_file in the bed_list:
# print out "the snp density for <bed_file> is:"
# run bedtools intersect on the bed_file and the vcf_file then pipe it to the awk statement to get the snp density
# print out "----" as a spacer
#!/bin/bash
BEDFILES=$1
VCF=$2
for BEDFILE in $(cat $BEDFILES)
do
echo "SNP density for $BEDFILE:"
bedtools intersect -c -a $BEDFILE -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{if(lens > 0){print snps/lens}}'
echo "---"
done
# for: loops over a list of values
# BEDFILE: this is the variable name of the current value in the list and can be used within the loop
# in: a keyword that precedes the list of values to loop over
# $(cat bed_files.txt): this runs the cat command on the bed_files.txt file and uses it as input to the loop
Use
chmod +x snp-density-5.sh
to make the script executable. Then let’s run the script in the terminal:
You should see a file called snp-densities.txt
in your data3 folder.
Let’s take a look at it using cat snp-densities.txt
. Congratulations!
You’ve made a working script that can be reused or just kept as a record
of how you processed your data. Can you think of ways in which the
script might be improved?
Part 3: Conditional statements
if statements
In this section, we’ll get into some more advanced bash scripting concepts. We’ll learn how to use conditional statements to control the flow of our scripts. Conditional statements can allow your script to make decisions based on the results of a command or the value of a variable, making it more responsive. Here is a simple example of a conditional statement:
FILE=bed_files.txt
if [[ -f "$FILE" ]]
then
echo "File $FILE exists."
fi
# if: a conditional statement which evaluates the expression contained in [[]] as true or false and executes the subsequent lines only if true
This conditional statement is made up of the following components:
- The
if
keyword - A command to execute (in this case, a conditional expression
enclosed in
[[
]]
). - The
then
keyword - The commands to run if the
if
command’s exit status is zero - The
fi
keyword
The if
keyword starts the conditional. The conditional expression
you’re checking is enclosed in the double square brackets. In this case,
the -f
checks if a file exists and the expression evaluates to true if
it does. The then
keyword starts the commands that need to be run if
the statement is true. Here we’re just printing to the terminal a string
that confirms that the file named $FILE
exists. The fi
keyword ends
the conditional. If the file doesn’t exist, nothing happens.
Note: you may see scripts that use
[
instead of[[
for conditional expressions. This older syntax is standardized and more portable to other (non-bash) shells, though it supports only a subset of conditional expressions supported by[[
.
An exit status is a small integer value returned returned by a command
to its caller upon completion. The if
statement interprets an exit
status of 0 for its command to mean “true”, and non-zero to mean
“false”. The [[
command exits with status 0 if its conditional
expression evaluates to true; otherwise, it exits with status 1.
The shell variable
$?
contains the exit status of the last command executed. Try copying/pasting each of the following lines individually, noting the exit status echoed to the screen for each (note the use of semicolons instead of newlines to separate statements for brevity):
true; echo $?
false; echo $?
[[ -f bed_files.txt ]]; echo $?
[[ -f bed_files.txt.nonexistent ]]; echo $?
As observed in the above code block, the true
and false
shell
commands silently exit with status 0 and non-zero, respectively. For
other commands, a zero exit status indicates “success”, and a non-zero
exit status indicates an error occurred.
if ls
then
echo "ls succeeded (exit status 0)!"
fi
if ls some-nonexistent-file
then
echo "should not see this!"
fi
# Note the use of the ! ("not") operator to negate the exit status of the command:
if ! ls some-nonexistent-file
then
echo "uh oh, ls failed!"
fi
Some commands assign different exit statuses to different error
conditions; see the program’s documentation for details (e.g., issue
man ls
and check the “Exit status” section near the bottom).
Exercise: Write a conditional statement that checks if the file
data3/poeFor_NW_006799939.vcf
exists and prints out the header. Write your solution in the code block below.
## Write a conditional statement that checks if a VCF file exists and prints out the header
FILE=data3/poeFor_NW_006799939.vcf
if [[ -f "$FILE" ]]
then
bcftools view -h "$FILE"
fi
# if: a conditional statement which evaluates the expression contained in [[]] as true or false and executes the subsequent lines only if true
There are some particularly useful conditional tests built in to bash that let you check file properties (see the complete list here):
-e filename
returns True if filename
exists as a file or directory or symlink
-f filename
returns True if filename
exists and is a regular file, not a
directory or symlink
-s filename
returns True if filename
exists and has a size > 0
-d filename
returns True if filename
is a directory
You can also check string properties:
-z string
returns True if string
is empty or null (0-length)
-n string
returns True if string
is not empty/null
Why we might want conditionals
You might want to process a file only if it exists, or you might want to
run different commands depending on the size or number of files. In a
larger workflow, you can imagine that sometimes, you’re cleaning a FASTQ
file and all the reads get thrown out leaving you without an output
cleaned FASTQ. You might want to be able to check for that condition and
have the script report it to you and skip the subsquent steps. For the
purposes of this workshops, we’re going to be using conditionals for
basic error checking in our snp-density-5.sh
script. To do that, we
will need to introduce if-else statements.
if-else statements
The if-else statement is similar to the if statement, but it has an additional set of commands to run if the boolean value is false. Here’s an example:
FILE="data3/poeFor_NW_006799939.vcf"
if [[ -f "$FILE" ]]; then
echo "$FILE exists."
else
echo "$FILE does not exist. Skipping subsequent steps."
fi
# if: a conditional statement which evaluates the expression contained in [[]] as true or false and executes the subsequent lines only if true
# else: a conditional statement which follows an if statement and executes the subsequent lines only if that statement was false
Exercise: Modify your
snp-density-5.sh
script to include an if-else statment so that it checks if the.bed
file exists. If the file exists, the script should continue. If the file does not exist, the script should print a message saying it doesn’t exist. Test your script by modifyingbed_files.txt
to include a file that doesn’t exist. Write your solution in the code block below (and also modify it in your bash script).
#!/bin/bash
BEDFILES=$1
VCF=$2
for BEDFILE in $(cat $BEDFILES)
do
if [[ -f "$BEDFILE" ]]
then
echo "SNP density for $BEDFILE:"
bedtools intersect -c -a $BEDFILE -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{if(lens > 0){print snps/lens}}'
echo "---"
else
echo "$BEDFILE does not exist. Skipping this file."
fi
done
This is fine, but we aren’t checking if the VCF file exists as well. We
want to modify the script to check if the VCF file exists and then exit
the script if it doesn’t. We can do this using the exit
command. The
exit
command exits the script and returns a status code. By
convention, a status code of 0 means the script exited successfully and
any other number means there was an error. Here is an example of using
an exit command:
Copy the below code into a file named
exit.sh
, make it executable withchmod +x exit.sh
and run it withbash exit.sh
to see how the exit command works. (Do not copy and paste it directly into terminal or you will exit the current terminal session and respawn a new session):
#!/bin/bash
for i in 1 2 3 4 5
do
if [[ $i -eq 3 ]]
then
echo "Exiting because i is $i"
exit 1
else
echo "i is $i"
fi
done
Exercise: Modify your
snp-density-5.sh
script to check if the VCF file exists and exit the script if it doesn’t. Test your script by giving running it likebash snp-density-5.sh bed_files.txt fake_vcf.vcf
Write your solution in the code block below (and also modify it in your bash script). Hint: you may want to check if the VCF does NOT exist, which will save you an else statement. Also think about where you want to place this if statement to reduce the number of times the script performs the check.
#!/bin/bash
BEDFILES=$1
VCF=$2
if [[ ! -f "$VCF" ]]
then
echo "$VCF does not exist. Exiting script."
exit 1
fi
for BEDFILE in $(cat $BEDFILES)
do
if [[ -f "$BEDFILE" ]]
then
echo "SNP density for $BEDFILE:"
bedtools intersect -c -a $BEDFILE -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{if(lens > 0){print snps/lens}}'
echo "---"
else
echo "$BEDFILE does not exist. Skipping this file."
fi
done
Putting it all together for a robust script
We’ve come a long way since the our first one-liner script
snp-density-1.sh
! Let’s copy snp-density-5.sh
to a new version
called snp-density-final.sh
where we’ll put on the finishing touches
that make the script easier to use, read, and monitor.
Using if statements to print helpful usage messages
Imagine you come back to this script 6 months later, vaguely recalling
that by running it, you can generate a report on SNP densities. However,
you don’t remember exactly how it works so you just naively run
bash snp-density-final.sh
. You just get the uninformative error below:
does not exist. Exiting script.
To save future you some confusion, we can use an if statement to print
out how the script is to be used if no arguments are given to it. The
syntax for checking if a variable is empty is -z $VARIABLE
.
Exercise: Use and
if
statement to check whether bothBEDFILE
andVCF
are given as arguments. Modify your script to print out a usage message if they are not and exit the script. If they are, then proceed with the rest of the script. Write your solution in the code block below (and also modify it in your bash script).
#!/bin/bash
BEDFILES=$1
VCF=$2
if [[ -z $BEDFILES || -z $VCF ]]
then
echo "Usage: ./snp-density-final.sh <file of bed files to process> <vcf file>"
exit 1
fi
if [[ ! -f "$VCF" ]]
then
echo "$VCF does not exist. Exiting script."
exit 1
else
for BEDFILE in $(cat $BEDFILES)
do
if [[ -e $BEDFILE ]]
then
echo "SNP density for $BEDFILE:"
bedtools intersect -c -a $BEDFILE -b $VCF | awk 'BEGIN{snps=0; lens=0} {snps+=$4; lens+=$3-$2} END{if(lens > 0){print snps/lens}}'
echo "---"
else
echo "File $BEDFILE does not exist. Skipping!"
fi
done
fi
Bonus note on elif
statements. Sometimes, you may want to check
for multiple exclusive conditions consecutively. In those cases, using
only if else can get complicated because each if
statement needs to be
ended with a fi
and those are hard to keep track of. Imagine a script
that extracts certain metadata from different file types. If the file is
a .vcf
, you might want to use bcftools
to process it. If the file is
a .bam
, you might want to use samtools
, etc. If the file is neither,
you might want to print an error message. Here is an example of how
elif
followed by an else
to catch everything else can be used to
manage that scenario:
You do not need to run the following code, just observe how it is structured.
#!/bin/bash
FILE=$1
if [[ $FILE == *.vcf ]]
then
echo "Processing VCF file with bcftools..."
# bcftools commands here
elif [[ $FILE == *.bam ]]
then
echo "Processing BAM file with samtools..."
# samtools commands here
else
echo "Unknown file type. Please provide a .vcf or .bam file."
exit 1
fi
Documenting your script
In the last section we made our script more user friendly by adding a usage message if not all of the variables were given as arguments. We can also make our script more readable by adding a detailed description of how the script works in the file itself using comments. We highly encourage everyong to annotate their code with comments. It makes it easier for others (and future you) to understand. In the case of a script, documentation typically means adding a description of what the script does, what the inputs are, what the outputs are, and the author of the script to the top of the file. Here’s an example of what that might look like:
Exercise: Add a description of what the script does, what the inputs are, what the outputs are, and the author of the script to the top of your
snp-density-final.sh
script. Write your solution in the code block below (and also modify it in your bash script).