Table of Contents
ATAC-seq (Assay for Transposase-Accessible Chromatin with high-throughput sequencing) is a method for determining chromatin accessibility across the genome. It utilizes a hyperactive Tn5 transposase to insert sequencing adapters into open chromatin regions (Fig. 1). High-throughput sequencing then yields reads that indicate these regions of increased accessibility.
The developers of the ATAC-seq method have published a detailed protocol for the laboratory procedure (Buenrostro et al., 2015).
Here are a few additional things to consider when planning an ATAC-seq experiment:
Like most high-throughput sequencing applications, ATAC-seq requires that biological replicates be run. This ensures that any signals observed are due to biological effects and not idiosyncracies of one particular sample or its processing. To begin with, two replicates per experimental group are sufficient.
With ATAC-seq, control groups are not typically run, presumably due to the expense and the limited value obtained. A control for a given sample would be genomic DNA from the sample that, instead of transposase treatment, is fragmented (e.g. by sonication), has adapters ligated, and is sequenced along with the ATAC sample. Such a control could be useful to help define regions of the genome that are more challenging to sequence or to align reads unambiguously.
3. PCR amplification
In preparing libraries for sequencing, the samples should be amplified using as few PCR cycles as possible. This will help to reduce PCR duplicates, which are exact copies of DNA fragments that can interfere with the biological signal of interest.
4. Sequencing depth
Sequencing depth will vary based on the size of the reference genome and the degree of open chromatin expected. For studies of human samples, Buenrostro et al. (2015) recommend more than 50 million mapped reads per sample.
5. Sequencing mode
For ATAC-seq, we recommend paired-end sequencing, for several reasons.
More sequence data leads to better alignment results. Many genomes contain numerous repetitive elements, and failing to align reads to certain genomic regions unambiguously renders those regions inaccessible to the assay. Additional sequence data, such as with paired-end sequencing, helps to reduce these alignment ambiguities.
With ATAC-seq, we are interested in knowing the full span of the DNA fragments generated by the assay. A DNA fragment generated by the ATAC is typically longer than a sequence read, so a read will define only one end of the fragment. Therefore, with single-end sequencing, we would have to guess where the other end of the fragment is. Since paired-end sequencing generates reads from both ends, the full span of the DNA fragment is known precisely.
PCR duplicates are identified more accurately. As noted above, PCR duplicates are artifacts of the procedure, and they should be removed as part of the analysis pipeline (see below for more details). However, computational programs that remove PCR duplicates (e.g. Picard's MarkDuplicates) typically identify duplicates based on comparing ends of aligned reads. With single-end reads, there is only one position to compare, and so any reads whose 5' ends match will be considered duplicates. Thus, many false positives may result, and perfectly good reads will be removed from further analysis. On the other hand, after paired-end sequencing, both ends of the original DNA fragments are defined. To be declared a duplicate, both ends of one fragment would need to match both ends of another fragment, which is far less likely to occur by chance. Therefore, paired-end sequencing leads to fewer false positives.
Programs, like those listed below (e.g. FastQC, Bowtie2, MACS2), are run on Odyssey by submitting jobs via the SLURM management system. The jobs take the form of shell scripts, which are submitted with the sbatch command. The shell scripts request computational resources (time, memory, and number of cores) for a job; it is better to request more resources than expected, rather than risk having a job terminated prematurely for exceeding its limits.
The raw sequence files generated by the sequencing core are in FASTQ format. They are gzip-compressed, with '.gz' file extensions. It is unnecessary, not to mention wasteful of time and disk space, to decompress the sequence files; all common bioinformatics tools can analyze compressed files.
For paired-end sequencing, there are two files per sample: <sample>.R1.fastq.gz and <sample>.R2.fastq.gz.
Samples that were sequenced on multiple lanes may have separate files for each lane; these should be concatenated using the
cat <sample>.lane1.R1.fastq.gz <sample>.lane2.R1.fastq.gz > <sample>.R1.fastq.gz cat <sample>.lane1.R2.fastq.gz <sample>.lane2.R2.fastq.gz > <sample>.R2.fastq.gz
However, different replicates should not be concatenated, but instead should be processed separately.
It is generally a good idea to generate some quality metrics for your raw sequence data. One tool that is commonly used for this purpose is FastQC.
On Odyssey, each sequence file would be analyzed like this:
module load fastqc fastqc <sample>.R1.fastq.gz
FastQC is efficient; it can process a file of 20 million reads in about 5 minutes with less than 250MB memory used. The output from FastQC is an HTML file, which can be examined via a web browser. The report lists various statistics about your reads, such as their count and lengths. It also provides graphical representations of your data based on a number of categories, including quality scores, GC levels, PCR duplicates, and adapter content.
The FastQC report is there to alert you to potential issues with your data, but it is not the final determinant of the outcome of your ATAC-seq experiment. Do not be overly concerned if your FastQC reports contain one or more red 'X' marks; this is not a reason to delete your data and start all over again.
For reads derived from short DNA fragments, the 3' ends may contain portions of the Illumina sequencing adapter. This adapter contamination may prevent the reads from aligning to the reference genome and adversely affect the downstream analysis. If you suspect that your reads may be contaminated with adapters (either from the FastQC report ["Overrepresented sequences" or "Adapter content" sections], or from the size distribution of your sequencing libraries), you should run an adapter removal tool. Here are two options:
One of the most widely used adapter removal programs is cutadapt. Cutadapt searches input reads for a given adapter sequence. When it finds the adapter, it removes the adapter and everything that follows it. Reads that do not match the adapter remain unaltered.
Some things to note when using cutadapt:
- The adapter sequences need to be provided via the
-a argument. If you do not know which adapters were used for your samples, consult the sequencing core.
- Cutadapt will attempt to match a minimal length of the provided adapter sequence. The default value for this argument (
-O) is 3bp. The downside of using such a small value is the possibility of false positives (trimming reads' good sequences that happen to match part of the adapter). On the other hand, increasing this parameter will result in more false negatives, since reads with adapter contamination may contain sequencing errors that prevent a match.
An alternative approach to adapter removal is provided by NGmerge, which was developed in the Informatics Group. Unlike cutadapt, NGmerge does not require that the adapter sequences be provided, nor does it require a parameter for the minimum length of adapter to match (in fact, it does not perform adapter matching at all). However, it works only with paired-end sequencing, so those with single-end sequencing should stick with cutadapt.
NGmerge is based on the principle that, with paired-end sequencing, adapter contamination occurs only when both reads fully overlap. The program tests each pair of reads for overlap, and in cases where they do, it clips the 3' overhangs of both reads (Fig. 2). Reads that do not overlap remain unaltered.
NGmerge is available on Odyssey in the ATAC-seq module:
module load ATAC-seq NGmerge -a -1 <sample>.R1.fastq.gz -2 <sample>.R2.fastq.gz -o <sample>
The output files will be <sample>_1.fastq.gz and <sample>_2.fastq.gz. Of the many arguments available with NGmerge, here are the most important ones for this application:
||Adapter-removal mode (must be specified)|
||Minimum length of overlap, i.e. the minimum DNA fragment length (default 50bp)|
||Number of cores on which to run|
For input files of 20 million paired reads, NGmerge should run in less than one hour on a single core, with minimal memory usage. Of course, the run-time will decrease with more cores (
Other than adapter removal, we do not recommend any trimming of the reads. Such adjustments can complicate later steps, such as the identification of PCR duplicates.
In order to align reads to a genome, the reference sequence must be indexed. This is a time- and memory-intense procedure, but it needs to be done only once for a given genome.
For many model organisms, the genome and pre-built reference indexes are available from iGenomes. Otherwise, Bowtie2 indexes are made from a FASTA genome file using the program
bowtie2-build <genome.fa> <genomeIndexName>
Once the indexes are built, the reads can be aligned using Bowtie2. A brief look at the manual reveals the large number of parameters and options available with Bowtie2. Here are a few that may benefit alignment of an ATAC-seq dataset on Odyssey:
||Maximum DNA fragment length (default 500bp). If you anticipate that you may have DNA fragments longer than the default value, you should increase this parameter accordingly; otherwise, alignments from such fragments will be considered not properly paired (see Fig. 3B below).|
||Bowtie2 has a number of alignment and effort parameters that interact in complex (and sometimes unexpected) ways. Preset collections of these parameters are provided for convenience; the default is
||Number of cores on which to run|
The output is a SAM file, which contains alignment information for each input read. The SAM can be compressed to a binary format (BAM) with SAMtools. This is best accomplished by piping the output from Bowtie2 directly to
samtools view, e.g.:
module load bowtie2 module load samtools bowtie2 --very-sensitive -x <genomeIndexName> -1 <sample>_1.fastq.gz -2 <sample>_2.fastq.gz | samtools view -b - > <BAM>
For input files of 20 million paired reads, this command takes around five hours. This can be decreased by increasing the number of cores in the Bowtie2 command. For example, we could specify eight cores for Bowtie2 with
-p 8 and adjust the request in the SLURM script to
#SBATCH -n 9 (that is, eight cores for Bowtie2 and one for SAMtools). The memory usage of Bowtie2 depends primarily on the genome length; enough must be requested to load the genome indexes.
Bowtie2 also provides (via stderr) a summary of the mapping results, including counts of reads analyzed, properly paired alignments, and reads that aligned to multiple genomic locations. By default, Bowtie2 will randomly assign one of multiple equivalent mapping locations for a read.
It is a well-known problem that ATAC-seq datasets usually contain a large percentage of reads that is derived from mitochondrial DNA (for example, see this discussion). Some have gone as far as using CRISPR to reduce mitochondrial contamination.
Assuming you have not gone the CRISPR route, you will have some mitochondrial reads in your sequence data. Since we are not interested in ATAC-seq peaks in the mitochondrial genome, these reads will only complicate the subsequent steps. Therefore, we recommend that they be removed from further analysis, via one of the following methods:
Remove the mitochondrial genome from the reference genome before aligning the reads. In human/mouse genome builds, the mitochondrial genome is labeled 'chrM'. That sequence can be deleted from the reference prior to building the genome index. The downside of this approach is that the alignment numbers will look much worse; all of the mitochondrial reads will count as unaligned.
Remove the mitochondrial reads after alignment. A python script, creatively named removeChrom, is available in the ATAC-seq module to accomplish this. For example, to remove all 'chrM' reads from a BAM file, we would run this:
module load samtools module load ATAC-seq samtools view -h <inBAM> | removeChrom - - chrM | samtools view -b - > <outBAM>
PCR duplicates are exact copies of DNA fragments that arise during PCR. Since they are artifacts of the library preparation procedure, they may interfere with the biological signal of interest. Therefore, they should be removed as part of the analysis pipeline.
One commonly used program for removing PCR duplicates is Picard's MarkDuplicates. It is run as follows:
module load picard java -jar $PICARD_TOOLS_HOME/picard.jar MarkDuplicates I=<inBAM> O=<outBAM> M=dups.txt REMOVE_DUPLICATES=true
The output file specified by
M= lists counts of alignments analyzed and duplicates identified.
Model-based Analysis of ChIP-Seq (MACS2) is a program for detecting regions of genomic enrichment. Though designed for ChIP-seq, it works just as well on ATAC-seq and other genome-wide enrichment assays that have narrow peaks. The main program in MACS2 is
callpeak, and its options are described below.
As input, MACS2 takes the alignment files produced in the previous steps. However, it is important to remember that the read alignments indicate only a portion of the DNA fragments generated by the ATAC. Therefore, we must consider how we want MACS2 to interpret the alignments.
Alignments to analyze
With paired-end sequencing, the types of alignments that are produced fall into two basic categories: "properly paired" and "singletons" (Fig. 3).
An important consideration when using MACS2 is deciding which types of alignments should be analyzed and how those alignments should be interpreted. The analysis mode is set by the
-f argument. Here are the options with MACS2:
Analyze only properly paired alignments, but ignore R2 reads and treat R1 reads as singletons. This is the default option (
-f AUTO). MACS2 creates a model of the fragment lengths and extends the 3' ends of the R1 reads to the calculated average length. An alternative is to skip this model building and instead extend each read to a specified length (e.g.,
--nomodel --extsize 300for 300bp fragments). The value of the length parameter is usually determined from the average size during library preparation. However, neither of these approaches utilizes the value of paired-end sequencing, which defines both fragment ends.
Analyze only properly paired alignments with
-f BAMPE. Here, the fragments are defined by the paired alignments' ends, and there is no modeling or artificial extension. Singleton alignments are ignored. This is the preferred option for using only properly paired alignments.
Analyze all alignments. For this approach, a python script, SAMtoBED, is available in the ATAC-seq module. This script converts the read alignments to BED intervals, treating the properly paired alignments as such and extending the singleton alignments as specified. There are four options for the singletons: ignore them, keep them as is, extend them to an arbitrary length (similar to the
--extsizeoption of MACS2), or extend them to the average length calculated from the properly paired alignments. Here is an example command, using the "extend to average length" option (
module load ATAC-seq samtools view -h <BAM> | SAMtoBED -i - -o <BED> -x -v
The output from SAMtoBED is a BED file that should be analyzed by MACS2 with
(Note that the BEDTools program bamtobed program cannot be used here, since its output is in a nonstandard BED format that MACS2 cannot analyze.)
In deciding among these analysis options, it may help to consider the counts produced by Bowtie2, which indicate how many alignments fall into each category. For example, if most of the reads were aligned in proper pairs, it may be sufficient to use option #2. On the other hand, option #3 is preferred if a substantial fraction of the reads are singletons.
In addition to the analysis mode explained above, MACS2 has a number of parameters and options to set. Here are a few to consider:
||Name of the sample. The output files will be named using the specified string as the prefix.|
||Effective genome size, i.e. the size of the organism’s genome that can be analyzed (not including Ns, repetitive sequences, etc.). This will be less than the actual genome size. Parameters are provided for some model organisms, and the default value is
||Minimum q-value (adjusted p-value, or false discovery rate [FDR]) for peak calling (default 0.05). Reducing this threshold will decrease the number of peaks identified by MACS2 but increase the confidence in the called peaks.|
||How to handle PCR duplicates (default:
Note that MACS2 is not multithreaded, so it runs on a single core only.
The full MACS2 command for an ATAC-seq dataset from C. elegans, using all alignments (converted to BED intervals), might look like this:
module load macs2 macs2 callpeak -t <BED> -f BEDPE -n NAME -g ce --keep-dup all
Calling peaks for 20 million fragments should require less than ten minutes and 1GB of memory.
There are three output files from a standard
macs2 callpeak run. For a run with
-n NAME, the output files are NAME_peaks.xls, NAME_peaks.narrowPeak, and NAME_summits.bed. The most useful file is NAME_peaks.narrowPeak, a plain-text BED file that lists the genomic coordinates of each peak called, along with various statistics (fold-change, p- and q-values, etc.).
Once the peaks have been identified by MACS2 for a set of samples, there are several follow-up steps that can be taken, depending on the experimental design.
Some researchers find it useful to generate visualizations of the peaks in a genomic context.
For ATAC-seq in model organisms, the peak file (NAME_peaks.narrowPeak) can be uploaded directly to the UCSC genome browser. Note that a peak file without a header line should have the following added to the beginning of the file:
An alternative visualization tool is the Integrative Genomics Viewer (IGV). Peak files can be loaded directly (File → Load from File). Viewing BAM files with IGV requires that they be sorted (by coordinate) and indexed using SAMtools. Note that the BAMs show the read alignments, but not the full fragment lengths as generated by the ATAC and analyzed by MACS2.
Comparing peak files
Determining genomic regions that are common or different to a set of peak files is best accomplished with BEDTools, a suite of software tools that enables "genome arithmetic."
For example, bedtools intersect determines regions that are common to two peak files, such as replicates of the same experimental group.
Finding differences between two peak files, such as control vs. experimental groups, is accomplished via bedtools subtract.
It is helpful to know what genomic features are near the peaks called by MACS2. One program that is commonly used to annotate peaks is ChIPseeker. Like MACS2, ChIPseeker was originally designed to be used in the analysis of ChIP-seq, but it works just as well with ATAC-seq.
ChIPseeker requires that the genome of interest be annotated with locations of genes and other features. The ChIPseeker user guide is extremely helpful in using this R/Bioconductor package.
Andrews S. (2010). FastQC: a quality control tool for high throughput sequence data. Available online at: http://www.bioinformatics.babraham.ac.uk/projects/fastqc
Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat Methods. 2013 Dec;10(12):1213-8.
Buenrostro JD, Wu B, Chang HY, Greenleaf WJ. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Curr Protoc Mol Biol. 2015 Jan 5;109:21.29.1-9.
Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, Cheng JX, Murre C, Singh H, Glass CK. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 2010 May 28;38(4):576-89.
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012 Mar 4;9(4):357-9.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9.
Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal. 2011;17:10-2.
Montefiori L, Hernandez L, Zhang Z, Gilad Y, Ober C, Crawford G, Nobrega M, Jo Sakabe N. Reducing mitochondrial reads in ATAC-seq using CRISPR/Cas9. Sci Rep. 2017 May 26;7(1):2451.
Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34.
Yu G, Wang LG, He QY. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics. 2015 Jul 15;31(14):2382-3.