Whole genome alignment with Cactus
Comparative genomics requires alignments between sequences from different populations or species. While alignment of small chunks of sequence (e.g. genes) between many species is relatively straightforward, whole genome alignment has been challenging. The Cactus genome alignment software and its associated tools has made this task feasible for up to hundreds of genomes. However, this can still be technically difficult to run. Here we have developed a Snakemake pipeline to facilitate running Cactus on a computing cluster. This is done by first running the cactus-prepare
command to generate the Cactus labeled phylogeny, which is used to guide the submission of jobs to the cluster. For more details on how Snakemake breaks up Cactus's steps, expand the box below.
The cactus-snakemake pipeline's rulegraph
Here is the rulegraph for the pipeline. It works in rounds based on the shape of the input phylogeny (hence the cycle). First, genomes at the tips are masked and then all internal nodes are aligned.

This pipeline is suitable for aligning genomes from different species
For pangenome inference and whole genome alignment between samples of the same species, see our Cactus-minigraph tutorial
Getting started
You will need several things to be able to run this pipeline:
- A computing cluster that uses SLURM, though it should be possible to extend it to any job scheduler that Snakemake supports.
- conda or mamba to install software. See Installing command line software if you don't have conda/mamba installed.
- Snakemake and the Snakemake SLURM plugin
- Singularity - The pipeline itself will automatically download the latest version of the Cactus singularity container for you.
- The Harvard Informatics cactus-snakemake pipeline
Below we walk you through our recommended way for getting this all set up.
Creating an enviornment for the cactus pipeline
We recommend creating a conda environment to install software:
mamba create -n cactus-env
mamba activate cactus-env
mamba install bioconda::snakemake-minimal
mamba install bioconda::snakemake-executor-plugin-slurm # For SLURM clusters
Some clusters (such as the Harvard cluster) already have Singularity installed. You should check by running the command:
If the help menu displays, you already have Singularity installed. If not, you will need to install it yourself into your cactus-env environment:Downloading the cactus-snakemake pipeline
The pipeline is currently available on github. You can install it on the Harvard cluster or any computer that has git
installed by navigating to the directory in which you want to download it and doing one of the following:
Using git with HTTPS
Using git with SSH
Alternatively, if you have SSH setup on github, you would type:
Using wget (without git)
If you don't have or don't wish to use git, you can directly download a ZIP archive of the repository:
wget https://github.com/harvardinformatics/cactus-snakemake/archive/refs/heads/main.zip
unzip cactus-snakemake
With that, you should be ready to set-up your data for the pipeline!
Inputs you need to prepare
To run this pipeline, you will need:
- A rooted phylogenetic tree of all species to align, with or without branch lengths, in Newick format.
- The softmasked genome FASTA files for each species.
- A reference genome to project the alignment to MAF format.
You will use these to create the input file for Cactus.
Preparing the Cactus input file
The various Cactus commands depend on a single input file with information about the genomes to align. This file is a simple tab delimited file.
The first line of the file contains the rooted input species tree in Newick format and nothing else (be sure to remember the semi-colon at the end of the Newick tree!).
Each subsequent line contains in the first column one tip label from the tree and in the second column the path to the genome FASTA file for that species.
The FASTA files must softmasked!
The genomes you provide in FASTA format must be softmasked before running Cactus, otherwise the the program will likely not complete. You can tell if a genome FASTA file is masked by the presence of lower-case nucleotides: a, t, c, or g. If your FASTA file has these lower-case characters, it has likely been softmasked. If not, you will have to mask the genome with a tool like RepeatMasker. Also importantly, the FASTA files should not be hard masked, meaning the masked bases are replaced with Ns.
For example, if one were running this pipeline on 5 species named A, B, C, D, and E, the input file may look something like this:
For more information about the Cactus input file, see their official documentation. There is also an example input file for a small test dataset here or at tests/evolverMammals/evolverMammals.txt
. For more info, see section: Test dataset.
Reference sample
In order to run the last step of the workflow that converts the HAL format to a readable MAF format (See pipeline outputs for more info), you will need to select one assembly as a reference assembly. The reference assembly's coordinate system will be used for projection to MAF format. You should indicate the reference assembly in the Snakemake config file (outlined below). For instance, if I wanted my reference sample in the above file to be C, I would put the string C
in the maf_reference:
line of the Snakemake config file.
Preparing the Snakemake config file
Be sure to start with the example config file as a template!
The config for the Cactus test data can be found at here or at tests/evolverMammals/evolverMammals-cfg.yaml
in your downloaded cactus-snakemake repo. Be sure to use this as the template for your project since it has all the options needed! Note: the partitions set in this config file are specific to the Harvard cluster. Be sure to update them if you are running this pipeline elsewhere.
Additionally, a blank template file is located here or at config-template.yaml
in your downloaded cactus-snakemake repo.
Besides the sequence input, the pipeline needs some extra configuration to know where to look for files and write output. That is done in the Snakemake configuration file for a given run. It contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see below). The first part should look something like this:
cactus_path: <path/to/cactus-singularity-image OR download>
input_file: <path/to/cactus-input-file>
output_dir: <path/to/desired/output-directory>
overwrite_output_dir: <True/False>
final_prefix: <desired name of final .hal and .maf files with all genomes appended>
maf_reference: <Genome ID from input_file>
tmp_dir: <path/to/tmp-dir/>
use_gpu: <True/False>
Simply replace the string surrounded by <> with the path or option desired. Below is a table summarizing these options:
Option | Description |
---|---|
cactus_path |
Path to the Cactus Singularity image. If blank or 'download', the image of the latest Cactus version will be downloaded and used. |
input_file |
Path to the input file containing the species tree and genome paths (described above). |
output_dir |
Directory where the all output will be written. |
overwrite_output_dir |
Whether to overwrite the output directory if it already exists (True/False). |
final_prefix |
The name of the final .hal and .maf files with all aligned genomes appended. The final files will be <final_prefix>.hal and <final_prefix>.maf . These files will be placed within output_dir . |
tmp_dir |
A temporary directory for Snakemake and Cactus to use. Should have lots of space. |
use_gpu |
Whether to use the GPU version of Cactus for the alignment (True/False). |
Specifying resources for each rule
Below these options in the config file are further options for specifying resource usage for each rule that the pipeline will run. For example:
preprocess_partition: "gpu_test"
preprocess_gpu: 1
preprocess_cpu: 8
preprocess_mem: 25000
preprocess_time: 30
Notes on resource allocation
- Be sure to use partition names appropriate your cluster. Several examples in this tutorial have partition names that are specific to the Harvard cluster, so be sure to change them.
- Allocate the proper partitions based on
use_gpu
. If you want to use the GPU version of cactus (i.e. you have setuse_gpu: True
in the config file), the partition for the rules preprocess, blast, and align must be GPU enabled. If not, the pipeline will fail to run. - The
gpu:
options will be ignored ifuse_gpu: False
is set. - mem is in MB and time is in minutes.
You will have to determine the proper resource usage for your dataset. Generally, the larger the genomes, the more time and memory each job will need, and the more you will benefit from providing more CPUs and GPUs.
Click below to take a look at an example to get a sense for how many resources you will need to allot, adjusting for the genome size of your organisms.
Example: Resource usage on a dataset of 22 turtle genomes
Example: Resource usage on a dataset of 22 turtle genomes
We have run the pipeline on 22 turtle genomes. The average genome size is 2210 Mb (2.2 Gb):

We allocated the following resources for the Cactus rules:
Step | Partition | Memory | CPUs | GPUs | Time |
---|---|---|---|---|---|
Preprocess | gpu | 100g | 8 | 2 | 1h |
Blast | gpu | 400g | 64 | 1 | 12h |
Align | gpu | 450g | 64 | 2 | 12h |
This resulted in the following real run times:

In general, increasing or decreasing CPUs or GPUs available will decrease runtime.
And max memory usages:

Increasing available memory may also decrease runtime.
Running the pipeline
With everything installed, the Cactus input file, and the Snakemake configuration file setup, you are now ready to run the pipeline.
Do a --dryrun
first
First, we want to make sure everything is setup properly by using the --dryrun
option. This tells Snakemake to display what jobs it is going to run without actually submitting them. This is important to do before actually submitting the jobs so we can catch any setup errors beforehand.
This is done with the following command, changing the snakefile -s
and --configfile
paths to the one you have created for your project:
snakemake -p -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus.smk> --configfile <path/to/your/snakmake-config.yml> --dryrun
Command breakdown
Command-line option | Description |
---|---|
snakemake |
The call to the snakemake workflow program to execute the workflow. |
-p |
Print out the commands that will be executed. |
-j <# of jobs to submit simultaneously> |
The maximum number of jobs that will be submitted to your SLURM cluster at one time. |
-e slurm |
Specify to use the SLURM executor plugin. See: Getting started. |
`-s | The path to the workflow file. |
--configfile <path/to/your/snakmake-config.yml> |
The path to your config file. See: Preparing the Snakemake config file. |
--dryrun |
Do not execute anything, just display what would be done. |
This command won't actually submit the pipeline jobs!
However even during a --dryrun
some pre-processing steps will be run, including creation of the output directory if it doesn't exist, downloading the Cactus Singularity image if cactus_path: download
is set in the config file, and running cactus-prepare
. These should all be relatively fast and not resource intensive tasks.
If this completes successfully, you should see a bunch of blue, yellow, and green text on the screen, ending with something like this (the number of jobs and Reasons: may differ for your project):
Job stats:
job count
-------- -------
align 4
all 1
append 1
blast 4
convert 4
copy_hal 1
preprocess 5
total 20
Reasons:
(check individual jobs above for details)
input files updated by another job:
align, all, append, blast, convert, copy_hal
output files have to be generated:
align, append, blast, convert, copy_hal, preprocess
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.
If you see any red text, that likely means an error has occurred that must be addressed before you run the pipeline.
Submitting the jobs
If you're satisfied that the --dryrun
has completed successfully and you are ready to start submitting Cactus jobs to the cluster, you can do so by simply removing the --dryrun
option from the command above:
snakemake -p -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus.smk> --configfile <path/to/your/snakmake-config.yml>
This will start submitting jobs to SLURM. On your screen, you will see continuous updates regarding job status in blue text. In another terminal, you can also check on the status of your jobs by running squeue -u <your user id>
.
Be sure you have a way to keep the main Snakemake process running.
Remember, that some of these steps can take a long time, or your jobs may be stuck in the queue for a while. That means the Snakemake job must have a persistent connection in order to keep running an submitting jobs. There are a few ways this can be accomplished, some better than others:
-
Keep your computer on and connected to the cluster. This may be infeasible and you may still suffer from connection issues. -
Run the Snakemake job in the background by usingnohup <snakemake command> &
. This will run the command in the background and persist even if you disconnect. However, it makes it difficult to check on the status of your command. -
Submit the Snakemake command itself as a SLURM job. This will require preparing and submitting a job script. This is a good solution.
-
Use a terminal multiplexer like GNU Screen or tmux. These programs allow you to open a sub-terminal within your currently connected terminal that remains even after you disconnect from the server. This is also a good solution.
Depending on the number of genomes, their sizes, and your wait in the queue, you will hopefully have your whole genome alignment within a few days!
Test dataset
Cactus provides a test dataset which we have setup to run in the tests/evolverMammals/
folder.
Here is a breakdown of the files so you can investigate them and prepare similar ones for your project:
File/directory | Description |
---|---|
evolverMammals-seq/ |
This directory contains the input sequence files for the test dataset in FASTA format. |
evolverMammals-cfg.yaml |
This is the config file for Snakemake and has all of the options you would need to setup for your own project. |
evolverMammals.txt |
This is the input file as required by Cactus. It has the rooted Newick tree on the first line, followed by lines containing the location of the sequence files for each tip in the tree. |
We recommend running this test dataset before setting up your own project.
First, open the config file, tests/evolverMammals/evolverMammals-cfg.yaml
and make sure the partitions are set appropriately for your cluster. For this small test dataset, it is appropriate to use any "test" partitions you may have. Then, update the path to tmp_dir
to point to a location where you have a lot of temporary space. Even this small dataset will fail if this directory does not have enough space.
After that, run a dryrun of the test dataset by changing into the tests/
directory and running:
cd tests/evolverMammals/
snakemake -p -j 10 -e slurm -s ../cactus.smk --configfile evolverMammals-cfg.yaml --dryurun
If this completes without error, run the pipeline by removing the --dryrun
option:
Pipeline outputs
The pipeline will output a .paf, a .hal, and a .fa file for every node in the input tree (including ancestral sequences). The final alignment file will be <final_prefix>.hal
, where <final_prefix>
is whatever you specified in the Snakemake config file.
The final alignment will also be presented in MAF format as <final_prefix>.<maf_reference>.maf
, again where <maf_reference>
is whatever you set in the Snakemake config. This file will include all sequences. Another MAF file, <final_prefix>.<maf_reference>.nodupes.maf
will also be generated, which is the alignment in MAF format with no duplicate sequences. The de-duplicated MAF file is generated with --dupeMode single
. See the Cactus documentation regarding MAF export for more info.
A suit of tools called HAL tools is included with the Cactus singularity image if you need to manipulate or analyze .hal files. There are many tools for manipulating MAF files, though they are not always easy to use. The makers of Cactus also develop taffy, which can manipulate MAF files by converting them to TAF files.
Questions/troubleshooting
1. I want to use a specific version of the Cactus singularity image. How can I do so?
1. Using a specific Cactus version
If you want to use a specific Cactus version, search the available versions in the repository and run the following command, substituting <desired version>
for the string of the version you want, e.g. "v2.9.3":
singularity pull --disable-cache docker://quay.io/comparative-genomics-toolkit/cactus:<desired version>
Then, in the Snakemake config file, set cactus_path:
to be the path to the .sif
file that was downloaded.
2. My jobs were running but my Snakemake process crashed because of connection issues/server maintenance! What do I do?
2. Snakemake crashes
As long as there wasn't an error with one of the jobs, Snakemake is designed to be able to resume and resubmit jobs pretty seamlessly. You just need to run the same command you ran to begin with and it should pickup submitting jobs where it left off. You could also run a --dryrun
first and it should tell you which jobs are left to be done.
3. How can I tell if my input Newick tree is rooted? If it isn't rooted, how can I root it? Or if it is rooted, how can I re-root it?
3. How can I tell if my input Newick tree is rooted?
The easiest way to check if your tree is rooted is probably to load the tree into R with the ape
package. This can be done with the following commands:
install.packages("ape") # Only if you don't have it installed already
library(ape)
tree <- read.tree("your-tree.nwk")
is.rooted(tree)
If the result of this text is TRUE
then your tree is rooted. If it is FALSE
your tree is unrooted.
If the tree is unrooted, or you want to re-root it, you can also do this in R with the root()
function. Make sure ape
is installed and loaded as above, and then:
Root by a tip label:
Or root by an internal node number:
In the second case, using the internal node number, if you need to know how R has labeled the nodes, you can view the tree with the node labels by doing:
After the tree has been rooted/re-rooted, you can write it to a file:
4. How can I tell if my genome FASTA files are softmasked? How can I mask them if they aren't already?
4. How can I tell if my genome FASTA files are softmasked?
Cactus requires the input genomes to be softmasked. This means that masked bases appear as lower case letters: a, t, c, g. Hopefully the source of your genome FASTA file has given you some information about how it was prepared, including how it was masked. If not, a very quick method to check for the occurrence of any lower case letter in the sequence is:
if grep -q '^[^>]*[a-z]' your-genome-file.fa; then echo "The FASTA file is soft-masked."; else echo "The FASTA file is NOT soft-masked."; fi
This is fast, but only detects the occurrence of a single lower case character. To count all the lower case characters at the cost of taking a couple of minutes, you can run:
awk 'BEGIN{count=0} !/^>/{for(i=1;i<=length($0);i++) if (substr($0,i,1) ~ /[acgt]/) count++} END{print count}' your-genome-file.fa
Importantly, your genomes should not be hard masked, which means that masked bases are replaced by Ns. Unfortunately, there are many reasons for Ns to appear in a genome fasta file, so it is difficult to tell if it is because it is hardmasked based on the presence of Ns alone. Hopefully the source of the file has left some documentation describing how it was prepared...
If your genomes are not softmasked and you wish to do so, you will have to run a program like RepeatMasker or RepeatModeler on it. Please consult the documentation for these tools.
5. I want to run this on a cluster with a job scheduler other than SLURM! What do I do?
5. Clusters other than SLURM?
Generally, it should be relatively easy to update the cluster profile (profiles/slurm_profile/config.yaml
) and use the appropriate Snakemake cluster executor.
If you need help or run into problems, please create an issue on the pipeline's github and we'll try and help - though it will likely be up to you to test on your own cluster, since we only have easy access to a cluster running SLURM.
6. I tried to run the pipeline and I ran into an error that I don't understand or can't resolve. What do I do?
6. Encountering errors
Please search for or create an issue on the pipeline's github that includes information about your input files, the command you ran, and the error that you are getting. The text of any log files would also be appreciated.
Additionally, if you are at Harvard, there are several ways to contact us to help you through your errors.
7. I have an idea to improve or add to the pipeline. What do I do?
7. Pipeline improvements
Great! Please create an issue on the pipeline's github describing your idea so we can discuss its implementation!