Introduction

MAKER1,2 is a popular genome annotation pipeline for both prokaryotic and eukaryotic genomes. This guide describes best practices for running MAKER on the FASRC cluster to maximize performance. For additional MAKER background and examples, see this tutorial. Most tutorial examples can be run on an compute node in an interactive job, prefixing any MAKER commands with singularity exec --cleanenv ${MAKER_IMAGE}. For general MAKER support, see the maker-devel Google Group.

MAKER on the FASRC Cluster

MAKER is run on the FASRC cluster using a provided Singularity image. This image was created from the MAKER Biocontainers image (which was automatically generated from the corresponding Bioconda package), and bundles both GeneMark-ES and RepBase RepeatMasker edition (including the default Dfam 3.23 library).

Prerequisites

  1. Create the empty MAKER control files by running the following interactive job from a FAS RC login node (as Singularity is not installed on the FAS RC login nodes):

    srun -p test,serial_requeue,shared sh -c 'singularity exec --cleanenv /n/singularity_images/informatics/maker/maker:3.01.03-repbase.sif maker -CTL'
    

    This results in 3 files:

    • maker_opts.ctl (required: modify this file)
    • maker_exe.ctl (do not modify this file)
    • maker_evm.ctl (optionally modify this file)
    • maker_bopts.ctl (optionally modify this file)
  2. In maker_opts.ctl:

    • (Required) If not using RepeatMasker, change:

      model_org=all

      to

      model_org=

      If using RepeatMasker, change model_org=all to an appropriate family/genus/species (or other taxonomic rank). The famdb.py utility can be used to query the combined Dfam/Repbase repeat library:

      $ srun --pty test,serial_requeue,shared singularity shell --cleanenv /n/singularity_images/informatics/maker/maker:3.01.03-repbase.sif
      ...
      Singularity> /usr/local/share/RepeatMasker/famdb.py -i /usr/local/share/RepeatMasker/Libraries/RepeatMaskerLib.h5 names Heliconius
      Exact Matches
      =============
      33416 Heliconius (scientific name)
      
      Non-exact Matches
      =================
      33418 Heliconius ethilla (scientific name), Heliconius ethilla (Godart, 1819) (authority)
      33419 Heliconius numata (scientific name), Heliconius numata (Cramer, 1780) (authority)
      ...
      Singularity> /usr/local/share/RepeatMasker/famdb.py -i /usr/local/share/RepeatMasker/Libraries//RepeatMaskerLib.h5  lineage --ancestors --descendants 'Heliconius melpomene'
      ...
      ...
      └─33416 Heliconius [538]
        └─34740 Heliconius melpomene [75]
          ├─171916 Heliconius melpomene rosina [0]
          ├─171917 Heliconius melpomene melpomene [88]
      ...
      Singularity> exit
      

      In this case, to use the species "Heliconius melpomene", specify model_org=heliconius_melpomene (i.e., case insensitive, replace spaces with underscores)

    • (Recommended) Change max_dna_len=100000 to max_dna_len=300000 to increase the length of the segments that the reference sequence is partitioned into for sequence alignment. This reduces the number of files created during MAKER execution, lessening the file metadata load on the FASRC scratch file system, which is one of the main constraints for MAKER scalability.

    • Other options may need to be adjusted (e.g., split_hit=10000 corresponds to the longest expected intron length). See this table2 for a description of additional relevant MAKER options in maker_opts.ctl and maker_bopts.ctl.

Example Job Script

There are two approaches to running MAKER on the FASRC cluster: (1) entirely with in a container on a single compute node (more reliable, but slower) and (2) on multiple compute nodes (launched using MPI from outside of the container; susceptible to conflicts with the user environment). Either example job script must be submitted (via sbatch) from a directory on a file system that is mounted on all compute nodes (e.g., directories prefixed with /n/, such as /n/scratchlfs). The MAKER datastore directory will be created in the directory this job script is submitted from (named using the reference sequence file name prefix, and ending in *-output).

Example Single-Compute-Node MAKER Job Script

The single-node approach is considered more robust (though less scalable), and is recommended if your if you have environment variables set in bash startup files (e.g., LD_LIBRARY_PATH or Perl-related environment variables) that may interfere with the operation of software in the MAKER container.

#!/bin/sh
# Customize --time and --partition as appropriate.
# --exclusive --mem=0 allocates all CPUs and memory on the node.
#SBATCH --partition=shared
#SBATCH --nodes=1
#SBATCH --mem=0
#SBATCH --exclusive
#SBATCH --time=0:30:00

MAKER_IMAGE=/n/singularity_images/informatics/maker/maker:3.01.03-repbase.sif

# Submit this job script from the directory with the MAKER control files

# RepeatMasker setup (if not using RepeatMasker, optionally comment-out these three lines)
export SINGULARITYENV_LIBDIR=${PWD}/LIBDIR
mkdir -p LIBDIR
singularity exec ${MAKER_IMAGE} sh -c 'ln -sf /usr/local/share/RepeatMasker/Libraries/* LIBDIR'

# singularity options:
# * --cleanenv : don't pass environment variables to container (except those specified in --env option-arguments)
# * --no-home : don't mount home directory (if not current working directory) to avoid any application/language startup files
# * --home /root : use /root as HOME (location of GeneMark license (.gm_key) in container image)
# Add any MAKER options after the "maker" command
# * -nolock reduces file creation overhead (lock files not needed when using MPI)
# * -nodatastore is suggested for Lustre, as it reduces the number of directories created
# * -fix_nucleotides needed for hsap_contig.fasta example data
singularity exec --no-home --home /root --cleanenv ${MAKER_IMAGE} mpiexec -n ${SLURM_JOB_CPUS_PER_NODE} maker -fix_nucleotides -nolock -nodatastore

Example Multi-Compute-Node MAKER Job Script

In the following job script, MAKER can scale across multiple nodes in the FAS RC cluster by increasing the sbatch --ntasks value (which indicates the total number of processor cores, or "CPUs", to allocate across any number of compute nodes). Increase --ntasks may cause the job to take longer to schedule and start. See FAS RC Slurm Partitions for a description of limits on jobs submitted to available Slurm partitions.

#!/bin/sh
# Customize --time, --ntasks, and --partition as appropriate
#SBATCH --time=0:30:00
#SBATCH --ntasks=8
#SBATCH --mem-per-cpu=4g
#SBATCH --partition=shared

MAKER_IMAGE=/n/singularity_images/informatics/maker/maker:3.01.03-repbase.sif

# Submit this job script from the directory with the MAKER control files

# Remove any environment modules
module purge

# Use Intel MPI for the "mpiexec" command
module load intel/21.2.0-fasrc01 impi/2021.2.0-fasrc01

# RepeatMasker setup (if not using RepeatMasker, optionally comment-out these three lines)
mkdir -p LIBDIR
singularity exec ${MAKER_IMAGE} sh -c 'ln -sf /usr/local/share/RepeatMasker/Libraries/* LIBDIR'
export LIBDIR=$PWD/LIBDIR

# singularity options:
# * --no-home : don't mount home directory (if not current working directory) to avoid any application/language startup files
# * --home /root : use /root as HOME (location of GeneMark license (.gm_key) in container image)
# Add any MAKER options after the "maker" command
# * the -mpi option is needed to use the host MPI for MAKER in a Singularity container
# * -nolock reduces file creation overhead (lock files not needed when using MPI)
# * -nodatastore is suggested for Lustre, as it reduces the number of directories created
# * -fix_nucleotides needed for hsap_contig.fasta example data
mpiexec -n ${SLURM_NTASKS} singularity exec --no-home --home /root ${MAKER_IMAGE} maker -mpi -fix_nucleotides -nolock -nodatastore

Troubleshooting

Warnings during execution

MAKER will emit the following warnings during execution; they can be ignored:

Possible precedence issue with control flow operator at /usr/local/lib/site_perl/5.26.2/Bio/DB/IndexedBase.pm line 805.

df: Warning: cannot read table of mounted file systems: No such file or directory

Memory allocation errors

If messages like the following appear in the Slurm job output file (slurm-<jobid>.out by default):

open3: fork failed: Cannot allocate memory at /usr/local/bin/../lib/Widget/blastx.pm line 40.
--> rank=17, hostname=holy2a09203.rc.fas.harvard.edu
ERROR: Failed while doing blastx of proteins
ERROR: Chunk failed at level:8, tier_type:3
FAILED CONTIG:Chr4

MAKER has been observed to continue execution, but not make progress. It is recommended to cancel the job (scancel <jobid>), increase the amount of memory per process (for single-node job script: decrease the number of processes; e.g., ...mpiexec -n $((SLURM_CPUS_PER_TASK*3/4))...; for multi-node job script: increase the #SBATCH --mem-per-cpu= option-argument).


Visualizing in JBrowse

JBrowse can be used to visualize MAKER-generated annotation, RNA/protein evidence sequence alignments, and RepeatMasker-masked regions in the context of the reference genome.

JBrowse 1 provides a script maker2jbrowse that automatically exports a MAKER datastore to a JBrowse 1 data directory that can be directly visualized in JBrowse 1. However, this script executes very slowly on a parallel file system (e.g., FASRC scratchlfs and holylfs file systems), and the resulting JBrowse data directory is completely unsuitable for visualization when located on a parallel file system due to a large number of small files created. An in-house customization of this script (ifxmaker2jbrowse) has been developed and tuned for parallel file systems. Execution time of ifxmaker2jbrowse is well over an order of magnitude faster than maker2jbrowse, and the resulting JBrowse data directory contains tens of files in standard formats usable by other tools (e.g., bgzip-compressed & tabix-indexed GFF3, and bgzip-compressed and samtools-faidx-indexed FASTA) instead of tens/hundreds of thousands of JBrowse-specific files.

ifxmaker2jbrowse

A Singularity image containing the ifxmaker2jbrowse script and and all dependencies is provided on the FAS RC cluster. The following example job script (submitted from the MAKER datastore directory) demonstrates its use:

#!/bin/sh
# Customize --time and --partition as appropriate
#SBATCH --time=0:60:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=32G
#SBATCH --partition=shared

# set to the pathname of reference FASTA file specified for the maker_opts.ctl "genome=" option.
REFERENCE_FASTA=../MY_REF.fa

# options to bgzip and sort (assuming GNU coreutils sort); these are used for optimizing performance and disk usage
export SINGULARITYENV_MAKER2JBROWSE_BGZIP_OPTIONS="--threads=${SLURM_CPUS_PER_TASK}"
export SINGULARITYENV_MAKER2JBROWSE_SORT_OPTIONS="--parallel=${SLURM_CPUS_PER_TASK} --stable --buffer-size=1G --compress-program=gzip"

# if you would like to omit the creation of a compressed/indexed reference FASTA file, and store just the
# reference sequence the lengths for use in JBrowse, add the `--noseq` option to the following command:
singularity run --cleanenv /n/singularity_images/informatics/maker/ifxmaker2jbrowse:20210108.sif --bgzip_fasta=${REFERENCE_FASTA} --no_names_index --ds_index *_master_datastore_index.log

The recommended ifxmaker2jbrowse --no_names_index option disables the creation of a searchable index of all feature names in JBrowse. If name-based indexing is desired for select tracks, this can subsequently be done more efficiently (resulting in fewer files) using the following options to the JBrowse generate-names.pl script (e.g., for the protein2genome and est2genome tracks):

# execute from the JBrowse data/ directory
singularity exec --cleanenv /n/singularity_images/informatics/maker/ifxmaker2jbrowse:20210108.sif generate-names.pl --tracks protein2genome,est2genome --hashBits 4 --compress --out .

Note that ths protocol generates a JBrowse 1 compatible tracks.conf. For guidance on using the jbrowse CLI to generate a JBrowse 2 compatible config.json, see the JBrowse on the FASRC Cluster guide.

Running JBrowse on the FASRC Cluster using Open OnDemand

A JBrowse instance can be launched on the FASRC cluster using Open OnDemand instance (https://vdi.rc.fas.harvard.edu/). From the menu, select Interactive Apps > JBrowse. Choose JBrowse Version JBrowse 1 (v1.x.x) (unless a JBrowse 2 config.json has been generated). In the "path of a JBrowse data directory" textbox, enter the absolute path to the JBrowse data/ directory that was created by the ifxmaker2jbrowse script (in the MAKER datastore directory), then click "Launch".

References


  1. Holt C, Yandell M. MAKER2: an annotation pipeline and genome-database management tool for second-generation genome projects. BMC Bioinformatics. 2011 Dec 22;12:491. doi: 10.1186/1471-2105-12-491. PMID: 22192575; PMCID: PMC3280279

  2. Campbell MS, Holt C, Moore B, Yandell M. Genome Annotation and Curation Using MAKER and MAKER-P. Curr Protoc Bioinformatics. 2014 Dec 12;48:4.11.1-39. doi: 10.1002/0471250953.bi0411s48. PMID: 25501943; PMCID: PMC4286374

  3. Hubley R, Finn RD, Clements J, Eddy SR, Jones TA, Bao W, Smit AF, Wheeler TJ. The Dfam database of repetitive DNA families. Nucleic Acids Res. 2016 Jan 4;44(D1):D81-9. doi: 10.1093/nar/gkv1272. Epub 2015 Nov 26. PMID: 26612867; PMCID: PMC4702899