Manipulating files with grep
What is grep?
grep is a powerful command-line search tools that is included as part of most Unix-like systems. It is one of the most useful tools in bioinformatics! At the most basic level, grep searches for a string of characters that match a pattern and will print lines containing a match. Basic syntax is:
grep 'pattern' file_to_search
By default, grep will match any part of the string, so for example:
echo 'my dog is brown' > sample.txt
grep 'dog' sample.txt
grep 'do' sample.txt
grep 'd' sample.txt
The above three grep commands will all match the text in the example file and will print the line. However, there are a huge number of arguments that can modify how grep behaves. Here are a few useful examples!
grep -w
matches entire words.
- So in the above example:
grep -w 'dog' sample.txt
would match the string, but grep -w 'do' sample.txt
would not.
grep -i
allows case-insensitive matches
- In the above example, grep -i 'DOG'
would still match the line
grep -v
inverts, returning lines that do not match the pattern.
grep -o
returns only the matching words, not the entire line.
grep -c
counts the number of lines that match the pattern.
- Equivalent to grep 'pattern' file | wc -l
Print lines before/after a match:
- grep -A [n]
returns matching line and n lines after match
- grep -B [n]
returns matching line and n lines before match
- grep -C [n]
returns matching line and n lines before and after match
Note: grep -A
is very useful for pulling out specific lines from a FASTA file...just make sure your FASTA file is single line and not multi-line!
grep -f pattern.txt
matches a list of patterns contained in pattern.txt against target file.
- I.e. the command grep -f patterns.txt file.txt
will match all patterns in patterns.txt against file.txt
- Pattern file has one pattern per line
Remember that we can combine different grep arguments with each other! E.g.:
Say we want to pull out a specific sequence from a FASTA file, like the X chromosome. We use the -w
flag to match whole words, combined with -A
to get the line after a match:
grep -w -A 1 '>X' dmel-all-chromosome-r6.20.fasta
There are many other functions of grep! When in doubt, remember you can check the help page using man grep
...or just by using google :)
Pattern matching with regular expressions
Regular expressions, aka "regex", are patterns that describe sets of strings. In other words, they allow you to match complex patterns with grep, not just exact matches. Regex is extremely powerful, but can also get (very) complicated, so we'll just stick to a few basic uses.
Regex has certain meta-characters that are reserved for special uses:
^: matches pattern at start of string
$: matches pattern at end of string
.: matches any character except new lines
[]: matches any of enclose characters
[^]: matches any characters *except* ones enclosed (note: is different from ^)
\: "escapes" meta-characters, allows literal matching
Note that if we want to match any of these special characters literally (e.g. matching a period character "."), we would need to use a "\" to escape it first:
grep '\.' dmel-all-no-analysis-r6.20.gff
will literally match a period.
One example of how regex can come in handy is using the ^ special character to quickly count how many sequences are in a FASTA file, which we would do as follows:
grep -c '^>' mel-all-chromosome-r6.20.fasta
This command matches lines in the FASTA file that start with a ">" character, i.e. the header lines, and uses the -c argument to count how many matches!
Here are a few more examples to show you how grep can help you wrangle your files!
- Use the
-v
flag to invert grep and filter out lines matching unassigned contigs (chrUn) in the file hg19.genome and direct the output to a file:
grep -v 'chrUn' hg19.genome > hg19_noUn.genome
- Use grep to pull out the header line of only the 2R chromosome arm from dmel-all-chromosome-r6.20.fasta, using
-w
to avoid partial matches:
grep -w '^>2R' dmel-all-chromosome-r6.20.fasta > 2R_header.txt
- Use grep from a list of patterns with
-f
to extract the lines of only the major chromosome arms (2L, 2R, 3L, 3R, and X) from the file dmel-all-no-analysis-r6.20.gff:
printf '^>%s\n' 2L 2R 3L 3R X > major_arms.txt
grep -w -f major_arms.txt \
dmel-all-no-analysis-r6.20.gff | less