Introduction and setup

The Pairwise Sequentially Markovian Coalescent (PSMC) model uses information in the complete diploid sequence of a single individual to infer the history of population size changes. Originally published in 2011 (Li and Durbin 2011), it has become a very popular tool in the world of genomics. In this tutorial, we walk through the steps to generate the necessary input data for PSMC and run it on recently published data from woolly mammoths (Palkopoulou et al., 2015).

The bam files and reference genome necessary to run the following scripts can be found at: /n/regal/informatics/workshops/PSMC_20150911

These data were originally downloaded from the Broad Institute (elephant reference genome) and the European Nucleotide Archive (bam file). If you download the data yourself, you will need to index both the fasta file and the BAM file with samtools before beginning.

Prior to running the scripts below, make a new directory in your lab's folder on regal, or in /n/regal/informatics/workshops/PSMC_20150911/Users if you do not have a lab directory on regal. This is where you will run all of your analysis.

Note that for this analysis, we are starting from a BAM file that contains reads already mapped to the reference genome, in this case the elephant. To run PSMC on your own data, you will need to first map your reads to a reference genome before adapting these scripts.

Finally, the slides (PDF) from the journal club presentation of this demo are here.

Calling consensus sequence

Starting from mapped reads, the first step is to produce a consensus sequence in FASTQ format. To do this, we will use the samtools/bcftools suite, following the methods described in the paper (Palkopoulou et al., 2015).

The basic idea behind generating a consensus sequence is to first use samtools mpileup to take the mapped reads and produce a VCF file. The consensus sequence is then generated by bcftools with the original consensus calling model, and converted to fastq (with some additional filtering) by A few notes:

  • Since Palkopoulou et al only analyzed autosomes, we will do the same, relying on the fact that the 27 autosomes are named chr1 - chr27 in the reference. This means we can easily use a slurm job array to separately process each chromosome.
  • We'll use Odyssey's module system to control software versions. For reference, we use samtools version 1.2 and bcftools version 1.2. Note that these are newer versions than used in the paper.
  • We'll take advantage of Unix pipes and the ability of samtools to work with streaming input and output to run the whole pipeline (samtools -> bcftools -> as one command.

First, let's set up some slurm parameters for each chromosome:

#SBATCH -n 1 #each chromosome will be processed on a single core (-n 1)  
#SBATCH -N 1 #on one machine (-N 1)  
#SBATCH -p serial_requeue #we'll use the serial_requeue queue 
#SBATCH -t 0-12:00 #most runs should finish in 3-4 hours but we'll a bit extra
#SBATCH --mem 2000 #most runs should be under 1 GB but we'll add a buffer
#SBATCH -J mpileup #a basic name for this job

We'll just use the defaults for the error logs, that is slurm_<jobid>.txt. However, we are going to run this as a job array. In this case, we know in advance that we want to run this on the 27 autosomes, so we'll specify the job array size in our script, like so:

#SBATCH --array = 1-27

For more on job arrays, peruse the research computing documentation.

Now, on to the actual code.

First, we'll load the necessary software using the Odyssey module system:

module load samtools/1.2-fasrc01  
module load bcftools/1.2-fasrc01

Now, since we've launched 27 jobs with the job array, we need to tell each separate job which chromosome to work on. We'll do that using a bash variable. We'll also set a bash variable to store the location of the directory with the bam files and reference genome:


Note: the variable $SLURM_ARRAY_TASK_ID stores the array id (1,2,3,4, ... ,27) for each of the separate array jobs.

Finally, we run our consensus calling pipeline, consisting of a linked set of samtools, bcftools, and commands:

samtools mpileup -Q 30 -q 30 -u -v \
-f $DATAPATH/loxAfr4.fa -r $CHR $DATAPATH/P964.bam |  
bcftools call -c | vcf2fq -d 5 -D 34 -Q 30 > P964.$CHR.fq

This takes as input an aligned bam file and a reference genome, generates an mpileup using samtools, calls the consensus sequence with bcftools, and then filters and converts the consensus to fastq format, writing the results for each chromosome to a separate fastq file. Some parameter explanations:

  1. samtools:
    • -Q and -q in mpileup determine the cutoffs for baseQ and mapQ, respectively
    • -v tells mpileup to produce vcf output, and -u says that should be uncompressed
    • -f is the reference fasta used (needs to be indexed)
    • -r is the region to call the mpileup for (in this case, a particular chromosome based on the array task id)
    • P964.bam is the bam file to use
  2. bcftools:
    • call -c calls a consensus sequence from the mpileup using the original calling method
    • -d 5 and -d 34 determine the minimum and maximum coverage to allow for vcf2fq, anything outside that range is filtered
    • -Q 30 sets the root mean squared mapping quality minimum to 30

Note that this script can be found at /n/regal/informatics/workshops/PSMC_20150911/mpileup.sbatch.

Running PSMC

PSMC takes the consensus fastq file, and infers the history of population sizes. Although it takes a variety of parameters to control the details of the model fitting, we will follow Palkopoulou et al and use the defaults. First, let's set up some slurm parameters for the PSMC run:

#SBATCH -n 1 
#SBATCH -p serial_requeue 
#SBATCH -t 0-16:00   
#SBATCH --mem 6000 
#SBATCH -J psmc

We'll again just use the defaults for the error logs, that is slurm_<jobid>.txt. Now on to the code. First we need to load the software we will need. In addition to PSMC, we'll need to load gnuplot in order to make the final PSMC figure.

module load psmc/0.6.5-fasrc01
module load gnuplot/4.6.4-fasrc01

The first thing we need to do is merge all our single-chromosome fastq files into a single consensus sequence, which we'll do using the unix tool cat.

cat P964.chr*.fq > P964.consensus.fq

Now we need to convert this fastq file to the input format for PSMC:

$PSMC_HOME/utils/fq2psmcfa P964.consensus.fq > P964.psmcfa

Then we can run PSMC, using the default options -- but note that we specify the -p parameter as the defaults reported in the paper differ from the current defaults.

psmc -p "4+25*2+4+6" -o P964.psmc P964.psmcfa

Finally, we make the PSMC plot, using the per-generation mutation rate -u and the generation time in years -g reported in the paper. Because the paper does not give exact parameters for how they produced the plot, this likely will look a bit different than the figure, but hopefully it will be pretty close.

$PSMC_HOME/utils/ -u 3.83e-08 -g 31 -p P964_plot P964.psmc

Note that this script can be found at /n/regal/informatics/workshops/PSMC_20150911/psmc.sbatch.