Note: See https://github.com/harvardinformatics/GenomeAnnotation for an updated comparision of software tools and approaches for genome annotation
Introduction
MAKER1,2 is a popular genome annotation pipeline for both prokaryotic and eukaryotic genomes.
This guide describes best practices for running MAKER on the FASRC cluster to maximize performance.
For additional MAKER background and examples, see this tutorial.
Most tutorial examples can be run on an compute node in an interactive job, prefixing any MAKER commands with singularity exec --cleanenv ${MAKER_IMAGE}
.
For general MAKER support, see the maker-devel Google Group.
MAKER on the FASRC Cluster
MAKER is run on the FASRC cluster using a provided Singularity image. This image was created from the MAKER Biocontainers image (which was automatically generated from the corresponding Bioconda package), and bundles both GeneMark-ES and RepBase RepeatMasker edition (including the default Dfam 3.23 library).
Prerequisites
-
Create the empty MAKER control files by running the following interactive job from a FAS RC login node (as Singularity is not installed on the FAS RC login nodes):
srun -p test,serial_requeue,shared sh -c 'singularity exec --cleanenv /n/singularity_images/informatics/maker/maker:3.01.03-repbase.sif maker -CTL'
This results in 3 files:
- maker_opts.ctl (required: modify this file)
- maker_exe.ctl (do not modify this file)
- maker_evm.ctl (optionally modify this file)
- maker_bopts.ctl (optionally modify this file)
-
In maker_opts.ctl:
-
(Required) If not using RepeatMasker, change:
model_org=all
to
model_org=
If using RepeatMasker, change
model_org=all
to an appropriate family/genus/species (or other taxonomic rank). The famdb.py utility can be used to query the combined Dfam/Repbase repeat library:$ srun --pty test,serial_requeue,shared singularity shell --cleanenv /n/singularity_images/informatics/maker/maker:3.01.03-repbase.sif ... Singularity> /usr/local/share/RepeatMasker/famdb.py -i /usr/local/share/RepeatMasker/Libraries/RepeatMaskerLib.h5 names Heliconius Exact Matches ============= 33416 Heliconius (scientific name) Non-exact Matches ================= 33418 Heliconius ethilla (scientific name), Heliconius ethilla (Godart, 1819) (authority) 33419 Heliconius numata (scientific name), Heliconius numata (Cramer, 1780) (authority) ... Singularity> /usr/local/share/RepeatMasker/famdb.py -i /usr/local/share/RepeatMasker/Libraries//RepeatMaskerLib.h5 lineage --ancestors --descendants 'Heliconius melpomene' ... ... └─33416 Heliconius [538] └─34740 Heliconius melpomene [75] ├─171916 Heliconius melpomene rosina [0] ├─171917 Heliconius melpomene melpomene [88] ... Singularity> exit
In this case, to use the species "Heliconius melpomene", specify
model_org=heliconius_melpomene
(i.e., case insensitive, replace spaces with underscores) -
(Recommended) Change
max_dna_len=100000
tomax_dna_len=300000
to increase the length of the segments that the reference sequence is partitioned into for sequence alignment. This reduces the number of files created during MAKER execution, lessening the file metadata load on the FASRC scratch file system, which is one of the main constraints for MAKER scalability. - Other options may need to be adjusted (e.g.,
split_hit=10000
corresponds to the longest expected intron length). See this table2 for a description of additional relevant MAKER options in maker_opts.ctl and maker_bopts.ctl.
-
Example Job Script
In the following job script, MAKER can scale across multiple nodes in the FAS RC cluster by increasing the sbatch --ntasks
value (which indicates the total number of processor cores, or "CPUs", to allocate across any number of compute nodes).
Increase --ntasks
may cause the job to take longer to schedule and start.
See FAS RC Slurm Partitions for a description of limits on jobs submitted to available Slurm partitions.
This example job script must be submitted (via sbatch) from a directory on a file system that is mounted on all compute nodes (e.g., directories prefixed with /n/, such as /n/scratchlfs). The MAKER datastore directory will be created in the directory this job script is submitted from (named using the reference sequence file name prefix, and ending in *-output).
#!/bin/sh
# Customize --time, --ntasks, and --partition as appropriate
#SBATCH --time=0:30:00
#SBATCH --ntasks=8
#SBATCH --mem-per-cpu=4g
#SBATCH --partition=shared
MAKER_IMAGE=/n/singularity_images/informatics/maker/maker:3.01.03-repbase.sif
# Submit this job script from the directory with the MAKER control files
# Remove any environment modules
module purge
# RepeatMasker setup (if not using RepeatMasker, optionally comment-out these three lines)
mkdir -p LIBDIR
singularity exec ${MAKER_IMAGE} sh -c 'ln -sf /usr/local/share/RepeatMasker/Libraries/* LIBDIR'
export LIBDIR=$PWD/LIBDIR
# singularity options:
# * --no-home : don't mount home directory (if not current working directory) to avoid any application/language startup files
# * --home /root : use /root as HOME (location of GeneMark license (.gm_key) in container image)
# Add any MAKER options after the "maker" command
# * the -mpi option is needed to use the host MPI for MAKER in a Singularity container
# * -nolock reduces file creation overhead (lock files not needed when using MPI)
# * -nodatastore is suggested for Lustre, as it reduces the number of directories created
# * -fix_nucleotides needed for hsap_contig.fasta example data
srun --mpi=pmi2 singularity exec --no-home --home /root ${MAKER_IMAGE} maker -mpi -fix_nucleotides -nolock -nodatastore
Troubleshooting
Warnings during execution
MAKER will emit the following warnings during execution; they can be ignored:
Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.
df: Warning: cannot read table of mounted file systems: No such file or directory
Memory allocation errors
If messages like the following appear in the Slurm job output file (slurm-<jobid>.out
by default):
open3: fork failed: Cannot allocate memory at /usr/local/bin/../lib/Widget/blastx.pm line 40.
--> rank=17, hostname=holy2a09203.rc.fas.harvard.edu
ERROR: Failed while doing blastx of proteins
ERROR: Chunk failed at level:8, tier_type:3
FAILED CONTIG:Chr4
MAKER has been observed to continue execution, but not make progress.
It is recommended to cancel the job (scancel <jobid>
), and increase the amount of memory per process (i.e., the #SBATCH --mem-per-cpu=
option-argument) before resubmitting.
Visualizing in JBrowse
JBrowse can be used to visualize MAKER-generated annotation, RNA/protein evidence sequence alignments, and RepeatMasker-masked regions in the context of the reference genome.
JBrowse 1 provides a script maker2jbrowse
that automatically exports a MAKER datastore to a JBrowse 1 data directory that can be directly visualized in JBrowse 1.
However, this script executes very slowly on a parallel file system (e.g., FASRC scratchlfs and holylfs file systems), and the resulting JBrowse data directory is completely unsuitable for visualization when located on a parallel file system due to a large number of small files created.
An in-house customization of this script (ifxmaker2jbrowse
) has been developed and tuned for parallel file systems.
Execution time of ifxmaker2jbrowse
is well over an order of magnitude faster than maker2jbrowse
, and the resulting JBrowse data directory contains tens of files in standard formats usable by other tools (e.g., bgzip-compressed & tabix-indexed GFF3, and bgzip-compressed and samtools-faidx-indexed FASTA) instead of tens/hundreds of thousands of JBrowse-specific files.
ifxmaker2jbrowse
A Singularity image containing the ifxmaker2jbrowse script and and all dependencies is provided on the FAS RC cluster. The following example job script (submitted from the MAKER datastore directory) demonstrates its use:
#!/bin/sh
# Customize --time and --partition as appropriate
#SBATCH --time=0:60:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --partition=shared
# set to the pathname of reference FASTA file specified for the maker_opts.ctl "genome=" option.
REFERENCE_FASTA=../MY_REF.fa
# options to bgzip and sort (assuming GNU coreutils sort); these are used for optimizing performance and disk usage
export SINGULARITYENV_MAKER2JBROWSE_BGZIP_OPTIONS="--threads=${SLURM_CPUS_PER_TASK}"
export SINGULARITYENV_MAKER2JBROWSE_SORT_OPTIONS="--parallel=${SLURM_CPUS_PER_TASK} --stable --buffer-size=1G --compress-program=gzip"
# if you would like to omit the creation of a compressed/indexed reference FASTA file, and store just the
# reference sequence the lengths for use in JBrowse, add the `--noseq` option to the following command:
singularity run --cleanenv /n/singularity_images/informatics/maker/ifxmaker2jbrowse:20210108.sif --bgzip_fasta=${REFERENCE_FASTA} --no_names_index --ds_index *_master_datastore_index.log
The recommended ifxmaker2jbrowse --no_names_index
option disables the creation of a searchable index of all feature names in JBrowse.
If name-based indexing is desired for select tracks, this can subsequently be done more efficiently (resulting in fewer files) using the following options to the JBrowse generate-names.pl script (e.g., for the protein2genome and est2genome tracks):
# execute from the JBrowse data/ directory
singularity exec --cleanenv /n/singularity_images/informatics/maker/ifxmaker2jbrowse:20210108.sif generate-names.pl --tracks protein2genome,est2genome --hashBits 4 --compress --out .
Note that ths protocol generates a JBrowse 1 compatible tracks.conf
.
For guidance on using the jbrowse CLI to generate a JBrowse 2 compatible config.json
, see the JBrowse on the FASRC Cluster guide.
Running JBrowse on the FASRC Cluster using Open OnDemand
A JBrowse instance can be launched on the FASRC cluster using Open OnDemand instance (https://vdi.rc.fas.harvard.edu/).
From the menu, select Interactive Apps > JBrowse.
Choose JBrowse Version JBrowse 1 (v1.x.x)
(unless a JBrowse 2 config.json
has been generated).
In the "path of a JBrowse data directory" textbox, enter the absolute path to the JBrowse data/ directory that was created by the ifxmaker2jbrowse script (in the MAKER datastore directory), then click "Launch".
References
-
Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 2011 Dec 22;12:491. doi: 10.1186/1471-2105-12-491. PMID: 22192575; PMCID: PMC3280279. ↩
-
Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using MAKER and MAKER-P. Curr Protoc Bioinformatics. 2014 Dec 12;48:4.11.1-39. doi: 10.1002/0471250953.bi0411s48. PMID: 25501943; PMCID: PMC4286374. ↩↩
-
Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AF, Wheeler TJ. The Dfam database of repetitive DNA families. Nucleic Acids Res. 2016 Jan 4;44(D1):D81-9. doi: 10.1093/nar/gkv1272. Epub 2015 Nov 26. PMID: 26612867; PMCID: PMC4702899. ↩