1. Introduction and prerequisites

This tutorial is intended for people who are familiar with the basics of unix but want to learn more about manipulating files and running commands.

2. Summary of Commands Covered

  • scp
  • redirection (> and 2>&1)
  • sort
  • wc
  • pipes |
  • grep
  • find
  • awk
  • bash for loops

Functionality Covered

We call this the intermediate tutorial and it is intended for people who are comfortable with creating files and moving around the filesystem but want to move onto more manipulation of data files.

3. Transferring Files To/From the Cluster

Linux/OS X

There are graphical tools to transfer files to and from the cluster but it is very handy to know the command line versions. For this we use the scp (secure copy) command.

The basic syntax is (and you need a terminal open on your local machine):

scp username@server:remotefilename localfilename

The remote filename can be the full path or just the filename if it's in your home directory. The local filename can just be . for your current directory.

For example for a single file we can use:

scp mclamp@login.rc.fas.harvard.edu:/n/mypath/seq/pog.fa .

For a directory and its contents we can use the -r option :

scp -r mclamp@login.rc.fas.harvard.edu:/n/mypath/seq/ .

Common mistakes include:

  • Forgetting the colon : separating the servername from the path
  • Mistyping the path
  • Forgetting the -r when copying directories
  • Not including the destination filename/directory
  • Running scp from the remote machine and not your local machine - check your command prompt people! Everyone does it at least once.

Windows

The RC documentation has info on how to use filezilla here https://rc.fas.harvard.edu/resources/documentation/transferring-data/copying-data-to-and-fromodyssey-using-scp

Exercises
  • Transfer the /n/regal/informatics/workshops/Intermediate_Unix/Data/seq directory used above to your local machine. Try and get the command right first time and use pwd/cut and paste to minimize typing.

4. Redirecting Output and Backgrounding processes

4.1 Redirecting Output

We introduced redirecting output briefly in the Basic Unix workshop. We said that we could put the output from any command into a file using the > redirect operator. For instance:

ls -l > mydir.dat

There are actually two types of output from a typical unix command stdout (standard out) which is what we were using above. There is also stderr (standard error) which is used for, not surprisingly, error messages. If we don't specify anything the shell assumes we're referring to stdout. If we want to differentiate them we refer to stdout as 1> and stderr as 2>. So if we want to put output and errors into separate files we'd do

mycommand 1> mycmd.out 2> mycmd.err

If we have both types of output and want to put them into the same file (the most common action) we can do

mycommand 2>&1 1> mycmd.out

This says to put the error output into the normal output stream and then the normal output stream into a file.

4.2 Backgrounding Processes

Most of the commands we've run in the previous workshop have finished almost immediately but with real analysis this is often not the case. To keep the command running but reclaim our command prompt we can end a command with & to tell the shell to run it in the background.

For instance

bash mylongcmd.sh > myfile.out &

If myfile.out exists this command will overwrite it. If you want to append to a file do:

bash mylongcmd.sh >> myfile.out &

We can also push something into the background once it's running by using the ctrl-z command to suspend it and give us our command prompt back. Once it's suspended we can type bg to push it into the background and it'll start running again.

Other useful commands when you have background jobs are:

  • Ctrl-z bg Suspend the job and then run it in the background
  • jobs Lists all running jobs
  • kill %1 Kill job 1
  • kill %% Kill all jobs
  • fg Put the last job back into the foreground (or use fg n for the nth job)
Exercises
  • Run the command `bash /n/regal/informatics/workshops/Intermediate_Unix/Data/mylongscript.sh` and put the output into a file
  • Suspend the job
  • Run the jobs command - what is it telling you?
  • Push the job into the background by typing bg
  • Rerun the jobs command - does the output make sense?
  • Kill your job

5. Slurm submission

We are now going to go over the basics of a slurm submission. You’re currently all logged into a login node where you can run short commands but anything that requires more than about 30 seconds of run time should be submitted to the slurm queue.

You can submit jobs completely on the command line but I recommend creating a script file and submitting that. This gives you a record of what you have run and the parameters you used. Here is a template for a typical slurm submit (sbatch) script.

#!/bin/bash
#
# These are comment lines starting with #
#
# These lines are interpreted by slurm.
#SBATCH -J <jobname>        # The name of the job (can be any string – make is something readable)
#SBATCH -N 1                # Ensure that all cores are on one machine
#SBATCH -n <n>              # Use n cores for the job
#SBATCH -t <n-nn:nn>        # Runtime in D-HH:MM
#SBATCH -p <queuename>      # Partition to submit to (serial_requeue/general)
#SBATCH --mem=<n>           # Memory pool for all cores in Mb (see also --mem-per-cpu)
#SBATCH -o <outfile>.%A.out # File to which STDOUT will be written (%A is replaced by the jobid)
#SBATCH -e <outfile>.%A.err # File to which STDERR will be written (%A is replaced by the jobid)
#SBATCH --mail-type=<type>  # Type of email notification- BEGIN,END,FAIL,ALL
#SBATCH --mail-user=<myemail@harvard.edu> # Email to which notifications will be sent
#SBATCH --constraint=holyib               # If you want to use the /n/holylfs storage

# The command can use the command line  parameters $1 here

source new-modules.sh

module load <mymodule>

some_command_here $1 > $1.out
another_command < $1.out > $.new

In the sections below we'll go through the various pieces.

In practice it's handy to keep a template sbatch file handy so you can copy it when you want to run a new set of commands on the cluster.

5.1 The Header Line

#!/bin/bash

This tells unix to use the /bin/bash command to execute the file.

5.2 The #SBATCH comment lines

In a bash script comment lines start with a #. The shell ignores these but when you submit a script to slurm all lines starting

#SBATCH

are treated differently and tell slurm how to schedule your job.

5.3 The commands

Underneath the #SBATCH lines we can start running commands. If you’re using centrally installed software these can include module load commands.

5.4 Testing

Before submitting this script test this script on the command line by running it for a short while using

bash myscript.sh

If there is a typo or something wrong with the script it will fail almost immediately. Fix the problem and test until it runs.

5.5 Submitting

Submit your script to the cluster using

sbatch myscript.h

When it’s submitted you’ll get a message containing the job id of the job.

5.6 Checking job status

If you’ve set the SBATCH parameters correctly you’ll get an email when your job runs/fails/etc. You can also check on the status of the job using the squeue command.

squeue –u <username>

For an individual job you can see the status using

sacct –j <jobid>

5.7 Killing jobs

If you have submitted something and you want to remove it from the queue use the scancel command

scancel <jobid>
Exercises
  • Copy a template slurm submission script from /n/regal/informatics/workshops/Intermediate_Unix/Slurm/template.sh to your home directory.
  • Edit this file and enter suitable values for all the parameters in the #SBATCH lines. We recommend
    • Use 1 machine
    • Run on 1 core
    • Use 1Mbyte of memory
    • Allow for 5 minutes of run time
    • Submit to the serial_requeue queue
    Inside the script run two commands `ls –l /tmp/` and `hostname` and append the output into a file.
  • Test your script on the command line (ctrl-c to get out of it)
  • Submit the script to slurm
  • Check the status of your job.

Summary of SLURM commands

The table below shows a summary of SLURM commands, along with LSF equivalents and an example. These commands are described in more detail below along with links to the SLURM doc site.

SLURM COMMANDSLURM EXAMPLE
Submit a batch serial job sbatch`sbatch runscript.sh`
Run a script interativelysrun`srun --pty -p interact -t 10 --mem 1000 /bin/hostname`
Kill a jobscancel`scancel 999999`
View status of queuessqueue`squeue -u akitzmiller`
Check current job by idsacct`sacct -j 999999`

6. Searching for files – find

Find is a sophisticated recursive directory search which can locate files based on a pattern but also execute commands on files when you find them.

Basic syntax is :

find <dir> -name <name pattern>

This will print all found files matching the pattern underneath the specified directory.

Example :

find . -name “*.fastq”

The power comes when you add on the -exec option and append a command.

Example:

find . -name “*.fastq” -exec ls -l {} \;

(You put {} where the filename should go and the command should always end with \;)

This command finds all fastq files and lists them.

Another usage for find is finding files that are newer or older than a certain time

find . -mtime 7

Will find all files below the current directory modified more than 7 days ago. Use -7 for less than 7 days ago.

We can also use find for finding large files (or indeed small files)

find . -size +1M

And if we want to know the size of the files we can add an -exec option

find . -size +1G -exec ls -ls {} \;
Exercises :
  • Find all files in your home directory older than 1 day
  • Find all files in /tmp/ newer than 7 days
  • Find all files under the /n/regal/informatics/workshops/Intermediate_Unix directory that end `.fa` and print their contents to the screen
  • Find all files under the /n/regal/informatics/workshops/Intermediate_Unix directory that are greater than 10G. How big are they?

Extra: A good set of find examples is here http://alvinalexander.com/unix/edu/examples/find.shtml

6.1 Basic file searching - grep

The grep command looks for strings within files. The basic use is :

grep 'mystring' myfile

For example

grep 'chr'  Data/AF1.bed 

finds all lines in Data/AF1.bed that contain the string chr

(We'll come back to more advanced grep later)

7. Pipes – joining commands together using |

You can feed the output of one command to the input of another using the | character

For example you can run three commands one after the other

<cmd1> | <cmd2> | <cmd3>

This is best illustrated with some examples:

Example 7.1

ls –l | less

This is useful if you’re listing a big directory and the output won’t fit on one screen.

Example 7.2

find . -name “*.bed” | wc -l

This will find how many *.bed files there are below the current directory (wc –l returns the number of lines)

Example 7.3

find . -name “*.bed” -exec cat {} \; | wc -l

This finds how many lines total in all bed files found

Example 7.4

find . -name “*.bed” -exec cat {} \; |grep chr20 | wc -l

This finds how many chr20 lines there are in all bed files

Example 7.5

Here's a more complicated but common and useful example

cat myfile.dat |grep chr20 | awk ’{print $1}| sort | uniq –c > myfile.out

Here we have 5 different commands all chained together. Breaking this down the command we have:

  • lists the contents of myfile.dat (cat)
  • for all lines that contain chr20 (grep)
  • prints the string in the first column (awk '{print $1}')
  • sorts the output (sort)
  • prints only the unique strings and how many occurrences there are (uniq -c)
  • puts the results into a file myfile.out (>)

Sometimes we have commands that absolutely need to take a file name rather than just read in input. In these cases we can use pipes and use the – character to replace the filename. For instance the samtools command on a file looks like :

samtools view –b –S myfile.sam

If we want to feed input into samtools directly without using the myfile.sam file we need to do

bowtie2 –x hg19 –U myfile.fq –p 32 | samtools view –v –S -
Exercises :
  • Find how many chr14 lines there are in /n/regal/informatics/workshops/Intermediate_Unix/Data/AF2.bed and /n/regal/informatics/workshops/Intermediate_Unix/Data/AF1.bed
  • Find and concatenate all .bed files under /n/regal/informatics/workshops/Intermediate_Unix/ and use pipe and another command to find the 10th row containing chr20

8. The sort command

When we create data output files we often want to manipulate the contents by sorting. The unix sort command can be used to sort files very easily and in many different ways.

When we have columns of data we often want to sort on a column to find the highest or lowest entry. A typical command looks like:

sort -nk4 <file> |less

There are a lot of useful options to sort. The most useful are :

  • n – sort numerically
  • k4 – sort starting on the 4th column
  • k4,5 – sort using the 4th and 5th columns only
  • r – reverse sort
  • u – sort and report unique lines
  • t”,” – set the field delimiter to a comma

Example 8.1

sort -nk2 AF1.bed # sorts the file by the 2nd column

We can have multiple column options

sort -k1,1 -k2,2n AF1.bed # sorts the file by chromosome first and then start coord
Exercises :
  • Find the highest scoring 10 entries in the AF1.bed file (score is the 5th column)
  • Concatenate the AF1.bed and AF2.bed files and sort the results by chromosome and then score.
  • What is the 3rd lowest score in chr9 in the resulting file?

Extra: More sort examples are at http://www.theunixschool.com/2012/08/linux-sort-command-examples.html

Sorting big files

The sort command by default uses the /tmp/ directory to store intermediate files as it’s sorting. For very large files this can fill up the /tmp/ directory and your sort will fail. You can change where sort keeps its temporary files by using the –T option to specify a different directory. For instance if you want to use your current directory to store temp files use :

sort –T . mybigfile.dat > mybigfile.sort

Of course you still have to have enough space in the new directory (use the df –h command to check)

9. Searching for strings – grep

grep is a fantastic command for searching through files and directories. The basic syntax is :

grep <pattern> <file>

So to find all entries for chr20 in our AF1.bed file we’d do:

grep ’chr20’ /n/regal/informatics/workshops/Intermediate_Unix/Data/AF1.bed

Useful options :

  • -v search for everything but the pattern
  • -n show the line number of the line found
  • -c show the count of the number of matched lines
  • -C n Show n lines context
  • -r search recursively down the directory tree ( is directory here)
  • -i ignore case
  • -H print the filename along with the file found
  • -f find files only
  • -d find directories only
  • -l only print the filename and not the line found (useful when there are multiple matches per file)

Extra :

Information about more complicated searching using regular expressions and egrep can be found here: http://ryanstutorials.net/linuxtutorial/grep.php

Exercises :
  • Find how many bed peaks (rows) there are in the AF2.bed file for chr10
  • Find which files contain the string AAAAAAAA in the Data directory

10. Manipulating file contents – awk

This is where things really get powerful. Awk is a ‘pattern scanning and processing’ utility. It lets you search and filter files based on column and pattern. The basic syntax is:

awk ’pattern { action }’ filename

Again this is best shown by example

awk ’$1 == “chr1” { print $1,$2,$3}’ AF1.bed

Here the pattern is $1 == “chr1” or column1 = chr1 and the action is print columns 1,2 and 3

We can do more interesting things with the pattern e.g.

awk ’$2 > 1000000 && $3 < 2000000 { print $0}’ AF1.bed

This only prints lines where the region is between 1 and 2Mb. The $0 represents the whole line

We could also do this by missing out the whole action

awk ’$2 > 1000000 && $3 < 2000000’ AF1.bed

We can also search for substrings using the ~ character

awk ’$1 ~ /1/ { print $1}’ AF1.bed

only prints out lines where the first column contains a 1

Similarly

awk ’$1 !~ /1/ {print $1}

This prints the first column where the first column doesn't contain a 1

There are other things you can include

awk ’$1 ~ /^1/ {print $2}

This prints the 2nd column where the 1st column starts with a 1 You can reference the line number and number of fields using NR and NF.

awk ’NR > 1# will only print out rows 2 - end

awk ’NF == 12# will only print out rows with exactly 12 fields

awk ’NR % 4 == 1# will only print lines where the line_number / 4 has a remainder of 1

Combining awk with sort and uniq

The uniq command omits repeated lines. It is often used to count repeated lines using the -c option

For example:

awk ’{print $1}’ Data/AF1.bed |sort | uniq -c

This counts how often each chromosome appears in the bed file.

There’s much more to awk but these commands will get you a long way. Let’s now do something useful :

Exercises:
  • Find how many entries there are in the AF1.bed file for each chromosome
  • Find how many entries scoring > 300 there are in the AF1.bed file for each chromosome
  • Find the read lengths in the fastq file (Hint: each fastq entry has 4 lines and the read length is on the 1st line of the entry)