Skip to content

Adding a genome to a whole genome alignment

Once you have your HAL file with a whole genome alignment, you may find that you need to add samples to it as your research questions expand. Re-running the whole alignment pipeline may be too computationally expensive. Luckily, Cactus provides a method to add genomes to a HAL file with minimal computational effort. Here, we replicate this method as a Snakemake pipeline by emulating the steps laid out by cactus-update-prepare. Specifically, we have implemented the "add to a branch" method, as that is the recommended way to add a genome to an alignment.

The cactus_update pipeline's rulegraph

Here is the rulegraph for the pipeline. It performs two rounds of alignment, one for the new node in the tree, and one to re-do the parent of the new node in the tree.

A directed cyclic graph showing the rules for the pipeline.'

Getting started

If you generated your HAL file with the Cactus Snakemake pipeline

The setup for this pipeline is identical to that of the whole genome alignment pipeline and the Minigraph-cactus pipeline, so if you generated your HAL file with one of those you likely already have everything in place. Just activate your environment that has Snakemake installed:

mamba activate cactus-env

Or replace cactus-env with whatever name you gave that environment. Also, be sure you know the location of the cactus-snakemake folder. Then you can skip ahead.

If you are using a HAL file generated by some other method

You will need several things to be able to run this pipeline:

  1. A computing cluster that uses SLURM, though it should be possible to extend it to any job scheduler that Snakemake supports.
  2. conda or mamba to install software. See Installing command line software if you don't have conda/mamba installed.
  3. Snakemake and the Snakemake SLURM plugin
  4. Singularity - The pipeline itself will automatically download the latest version of the Cactus singularity container for you.
  5. The Harvard Informatics cactus-snakemake pipeline

Below we walk you through our recommended way for getting this all set up.

Creating an enviornment for the cactus pipeline

We recommend creating a conda environment to install software:

mamba create -n cactus-env
mamba activate cactus-env
mamba install bioconda::snakemake-minimal
mamba install bioconda::snakemake-executor-plugin-slurm # For SLURM clusters

Some clusters (such as the Harvard cluster) already have Singularity installed. You should check by running the command:

singularity
If the help menu displays, you already have Singularity installed. If not, you will need to install it yourself into your cactus-env environment:

mamba install conda-forge::singularity

Downloading the cactus-snakemake pipeline

The pipeline is currently available on github. You can install it on the Harvard cluster or any computer that has git installed by navigating to the directory in which you want to download it and doing one of the following:

Using git with HTTPS

git clone https://github.com/harvardinformatics/cactus-snakemake.git

Using git with SSH

Alternatively, if you have SSH setup on github, you would type:

git clone git@github.com:harvardinformatics/cactus-snakemake.git

Using wget (without git)

If you don't have or don't wish to use git, you can directly download a ZIP archive of the repository:

wget https://github.com/harvardinformatics/cactus-snakemake/archive/refs/heads/main.zip
unzip cactus-snakemake

With that, you should be ready to set-up your data for the pipeline!

Inputs you need to prepare

To run this pipeline, you will need (corresponding Snakemake config option given in parentheses):

  1. A HAL file with a whole genome alignment generated by Cactus (input_hal).
  2. The location in the tree to add your alignment (see below).
  3. The softmasked genome FASTA file for the genome you want to add to the alignment (new_genome_fasta).
  4. A reference genome to project the alignment to MAF format (maf_reference).

The FASTA file must softmasked!

The genome you provide in FASTA format must be softmasked before running Cactus, otherwise the the program will likely not complete. You can tell if a genome FASTA file is masked by the presence of lower-case nucleotides: a, t, c, or g. If your FASTA file has these lower-case characters, it has likely been softmasked. If not, you will have to mask the genome with a tool like RepeatMasker. Also importantly, the FASTA files should not be hard masked, meaning the masked bases are replaced with Ns.

The location in the tree to add your alignment

Each HAL file requires a phylogenetic tree to generate. When adding a genome to an alignment, we also must add its location in the tree. The easiest way to know this tree and how Cactus has labeled the nodes in it is to run the command:

halStats --tree example.hal

replacing example.hal with your HAL's file name. This will print the tree to the screen. For example, in the context of the evolverMammals test data and using the singularity image, we would run the following:

singularity exec --cleanenv <cactus-image-name>.sif halStats --tree evolverMammals.hal

which would result in:

((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;

Now that we have the tree, we need to figure out where to put our new genome. We will need to come up with the following information:

  1. A tip label or name for our new genome (new_genome_name).
  2. The branch length of the new branch connecting the new genome to an existing branch (new_branch_length).
  3. A label or name for the new node in our tree, connecting the new branch to an existing branch (new_anc_node).
  4. The branch on which to add the new node, defined by a parent (parent_node) and a child (child_node) node.
  5. The branch on which we add that node will have its length split into two separate branches. We must provide the top-most branch length of these two new branches (i.e. the one defined by our new node as the child) (top_branch_length).

We borrow and slightly modify an image from the cactus documentation to visualize these pieces of information on an example tree:

Two panels, the first showing a phylogenetic tree with 3 tips and internal nodes labeled, the second showing a 4th tip being added to the tree.

In this context, we are adding the genome with the name "6" to our HAL. We are adding it such that it branches off from the branch defined by node 4 as the child and node 5 as the parent. To do so, we create a new node, which we come up with a name for (let's say RC for red circle), and a new branch 6-RC. This new RC node splits the 4-5 branch into two new branches: 4-RC and RC-5. For the pipeline you will need to provide the branch length of the new 6-RC branch and the new of RC-5 (top-most) branch.

If you're very good at parsing Newick tree strings by eye, you may be able to get this information just by looking at the output of halStats --tree. However in most cases, you'll want to look at an image of the tree. Consider using some sort of tree viewing software like SeaView or the ape library in R. EMBL also has an online, interactive tree viewer where you can just paste the tree string to see an image of it.

Once you have the 5 pieces of information from the tree listed above, you're ready to move on to prepare the Cactus update input file.

No branch lengths on Cactus input tree

Cactus doesn't strictly require branch lengths on the input tree. In that case, it arbitrarily gives every branch a length of 1. If your input tree to cactus didn't have branch lengths or the output of halStats --tree shows only branch lengths of 1.0, then you don't need to specify the top-most branch length or the new branch length. These will automatically be set to 1.0.

Reference sample

In order to run the last step of the workflow that converts the HAL format to a readable MAF format (See pipeline outputs for more info), you will need to select one assembly as a reference assembly. The reference assembly's coordinate system will be used for projection to MAF format. You should indicate the reference assembly in the Snakemake config file (outlined below). For instance, if I wanted my reference sample in the above tree to be the genome labeled 1 in the tree, I would put the string 1 in the maf_reference line of the Snakemake config file.

Preparing the Snakemake config file

The Cactus update input file

The various Cactus commands depend on a single input file with information about the genomes to align. This file is automatically generated by the pipeline at [output_dir]/cactus-update-input.txt. This file is a simple tab delimited file and should contain one line:

[tip label to add to tree]   [path/to/genome/fasta.file]    [new branch length to add to the tree]

For more information about the Cactus update input file, see their official documentation.

Be sure to start with the example config file as a template!

The config for the Cactus test data can be found at here or at tests/evolverMammals/evolverMammals-update-cfg.yaml in your downloaded cactus-snakemake repo. Be sure to use this as the template for your project since it has all the options needed! Note: the partitions set in this config file are specific to the Harvard cluster. Be sure to update them if you are running this pipeline elsewhere.

Additionally, a blank template file is located here or at config-templates/update-config-template.yaml in your downloaded cactus-snakemake repo.

Once you have all the information listed above, you can enter it into the Snakemake configuration file along with some other information to know where to look for files and write output. The config file contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see below). The first part should look something like this:

cactus_path: <path/to/cactus-singularity-image OR download OR a version string (e.g. 2.9.5)>

cactus_gpu_path: <path/to/cactus-GPU-singularity-image OR download OR a version string (e.g. 2.9.5)>

input_hal: <path/to/hal-file>

new_genome_name: <tip label of new genome>

new_genome_fasta: <path/to/new/genome.fasta>

new_anc_node: <label for new ancestral node connected to new genome>

new_branch_length: <new branch length to connect the new genome to the tree with>

parent_node: <parent node of existing branch>

child_node: <child node of existing branch>

top_branch_length: <length of branch created by parent_node and new_anc_node>

output_dir: <path/to/desired/output-directory>

overwrite_original_hal: <True/False>

final_prefix: <desired name of final .hal and .maf files with all genomes appended>

maf_reference: <Genome ID from input_file>

tmp_dir: <path/to/tmp-dir/>

use_gpu: <True/False>

Simply replace the string surrounded by <> with the path or option desired. Below is a table summarizing these options:

Option Description
cactus_path Path to the Cactus Singularity image. If blank or 'download', the image of the latest Cactus version will be downloaded and used. If a version string is provided (e.g. 2.9.5), then that version will be downloaded and used. This will be used whether use_gpu is True or False.
cactus_gpu_path Path to the Cactus GPU Singularity image. If blank or 'download', the image of the latest Cactus version will be downloaded and used. If a version string is provided (e.g. 2.9.5), then that version will be downloaded and used. This will only be used if use_gpu is True.
input_hal Path to the previously generated HAL file to which you want to add a new genome.
new_genome_name The label to give your new genome in the tree and HAL file.
new_genome_fasta The path to the FASTA file containing the new genome to add to the alignment.
new_anc_node The name to give the new ancestral node that connects the new genome to an existing branch in the tree.
new_branch_length The length of the branch to create that will connect your new genome to the tree.
parent_node The name of the ancestral node of the existing branch to which the new branch is connected.
child_node The name of the descendant node of the existing branch to which the new branch is connected.
top_branch_length The existing branch defined by parent_node and child_node will be split by new_anc_node. This is the length of the new, top-most branch created by the split (defined by parent_node and new_anc_node).
output_dir Directory where the all output will be written.
overwrite_original_hal Whether to overwrite the original HAL file (True) or to make a copy and add the new genome to it (False). Overwriting will save storage space, but risks corruption of the original file.
final_prefix The name of the final .hal (if overwrite_original_hal is False) and .maf files with all aligned genomes appended. The final files will be <final_prefix>.hal and <final_prefix>.maf. These files will be placed within output_dir.
tmp_dir A temporary directory for Snakemake and Cactus to use. Should have lots of space.
use_gpu Whether to use the GPU version of Cactus for the alignment (True/False).

Specifying resources for each rule

Below these options in the config file are further options for specifying resource usage for each rule that the pipeline will run. For example:

rule_resources:
  preprocess:
    partition: shared
    mem_mb: 25000
    cpus: 8
    time: 30

  blast:
    partition: gpu    # If use_gpu is True, this must be a partition with GPUs
    mem_mb: 50000
    cpus: 48
    gpus: 1           # If use_gpu is False, this will be ignored
    time: 30

The rule blast is the only one that uses GPUs if use_gpu is True.

Notes on resource allocation

  • Be sure to use partition names appropriate your cluster. Several examples in this tutorial have partition names that are specific to the Harvard cluster, so be sure to change them.
  • Allocate the proper partitions based on use_gpu. If you want to use the GPU version of cactus (i.e. you have set use_gpu: True in the config file), the partition for the rule blast must be GPU enabled. If not, the pipeline will fail to run.
  • The blast: gpus: option will be ignored if use_gpu: False is set.
  • mem is in MB and time is in minutes.

You will have to determine the proper resource usage for your dataset. Generally, the larger the genomes, the more time and memory each job will need, and the more you will benefit from providing more CPUs and GPUs.

We have an example of resource allocation and usage for creation of a new HAL file. Since updating a HAL file uses the same steps, just with fewer rounds, you can gauge resource usage with this example.

Running the pipeline

With everything installed and the Snakemake configuration file setup, you are now ready to run the pipeline.

Do a --dryrun first

First, we want to make sure everything is setup properly by using the --dryrun option. This tells Snakemake to display what jobs it is going to run without actually submitting them. This is important to do before actually submitting the jobs so we can catch any setup errors beforehand.

This is done with the following command, changing the snakefile -s and --configfile paths to the one you have created for your project:

snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_update.smk> --configfile <path/to/your/snakmake-config.yml> --dryrun
Command breakdown
Command-line option Description
snakemake The call to the snakemake workflow program to execute the workflow.
-j <# of jobs to submit simultaneously> The maximum number of jobs that will be submitted to your SLURM cluster at one time.
-e slurm Specify to use the SLURM executor plugin. See: Getting started.
-s </path/to/cactus_update.smk> The path to the workflow file.
--configfile <path/to/your/snakmake-config.yml> The path to your config file. See: Preparing the Snakemake config file.
--dryrun Do not execute anything, just display what would be done.

This command won't actually submit the pipeline jobs!

However even during a --dryrun some pre-processing steps will be run, including creation of the output directory if it doesn't exist, downloading the Cactus Singularity image if cactus_path: download is set in the config file, and running cactus-update-prepare. These should all be relatively fast and not resource intensive tasks.

If this completes successfully, you should see a bunch of blue, yellow, and green text on the screen, ending with something like this (the number of jobs and Reasons: may differ for your project):

Job stats:
job                count
---------------  -------
add_to_branch          1
align                  2
all                    1
blast                  2
convert                2
copy_or_get_hal        1
maf                    1
preprocess             1
total                 11

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        add_to_branch, align, all, blast, convert, maf
    output files have to be generated:
        add_to_branch, align, blast, convert, copy_or_get_hal, maf, preprocess
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

If you see any red text, that likely means an error has occurred that must be addressed before you run the pipeline.

Submitting the jobs

If you're satisfied that the --dryrun has completed successfully and you are ready to start submitting Cactus jobs to the cluster, you can do so by simply removing the --dryrun option from the command above:

snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_update.smk> --configfile <path/to/your/snakmake-config.yml>

This will start submitting jobs to SLURM. On your screen, you will see continuous updates regarding job status in blue text. In another terminal, you can also check on the status of your jobs by running squeue -u <your user id>.

Be sure you have a way to keep the main Snakemake process running.

Remember, that some of these steps can take a long time, or your jobs may be stuck in the queue for a while. That means the Snakemake job must have a persistent connection in order to keep running an submitting jobs. There are a few ways this can be accomplished, some better than others:

  1. Keep your computer on and connected to the cluster. This may be infeasible and you may still suffer from connection issues.

  2. Run the Snakemake job in the background by using nohup <snakemake command> &. This will run the command in the background and persist even if you disconnect. However, it makes it difficult to check on the status of your command.

  3. Submit the Snakemake command itself as a SLURM job. This will require preparing and submitting a job script. This is a good solution.

  4. Use a terminal multiplexer like GNU Screen or tmux. These programs allow you to open a sub-terminal within your currently connected terminal that remains even after you disconnect from the server. This is also a good solution.

Depending on the number of genomes, their sizes, and your wait in the queue, you will hopefully have your updated whole genome alignment within a few hours!

Test dataset

Cactus provides a test dataset which we have setup to run in the tests/evolverMammals/ folder.

Here is a breakdown of the files so you can investigate them and prepare similar ones for your project:

File/directory Description
evolverMammals-seq/ This directory contains the input sequence files for the test dataset in FASTA format.
evolverMammals-update-cfg.yaml This is the config file for Snakemake and has all of the options you would need to setup for your own project.

You will first need to run the test to generate the HAL file. Then, you can add the Gorilla sequence to it using this pipeline. We recommend running this test dataset before setting up your own project.

After you've generated the HAL file, to add a genome, open the config file, tests/evolverMammals/evolverMammals-update-cfg.yaml and make sure the partitions are set appropriately for your cluster. For this small test dataset, it is appropriate to use any "test" partitions you may have. Then, update the path to tmp_dir to point to a location where you have a lot of temporary space. Even this small dataset will fail if this directory does not have enough space.

After that, run a dryrun of the test dataset by changing into the tests/evolverMammals directory and running:

cd tests/evolverMammals/
snakemake -j 10 -e slurm -s ../../cactus_update.smk --configfile evolverMammals-cfg-update.yaml --dryurun

If this completes without error, run the pipeline by removing the --dryrun option:

snakemake -j 10 -e slurm -s ../../cactus_update.smk --configfile evolverMammals-cfg-update.yaml

Pipeline outputs

The pipeline will output a .paf, a .hal, and a .fa file for the new ancestral node as well as the parent node. If you specified overwrite_original_hal: False the final alignment file will be <final_prefix>.hal, where <final_prefix> is whatever you specified in the Snakemake config file. Otherwise, the original HAL will be modified in place.

The final alignment will also be presented in MAF format as <final_prefix>.<maf_reference>.maf, again where <maf_reference> is whatever you set in the Snakemake config. This file will include all sequences. Another MAF file, <final_prefix>.<maf_reference>.nodupes.maf will also be generated, which is the alignment in MAF format with no duplicate sequences. The de-duplicated MAF file is generated with --dupeMode single. See the Cactus documentation regarding MAF export for more info.

A suite of tools called HAL tools is included with the Cactus singularity image if you need to manipulate or analyze .hal files. There are many tools for manipulating MAF files, though they are not always easy to use. The makers of Cactus also develop taffy, which can manipulate MAF files by converting them to TAF files.

Questions/troubleshooting

1. My jobs were running but my Snakemake process crashed because of connection issues/server maintenance! What do I do?
1. Snakemake crashes

As long as there wasn't an error with one of the jobs, Snakemake is designed to be able to resume and resubmit jobs pretty seamlessly. You just need to run the same command you ran to begin with and it should pickup submitting jobs where it left off. You could also run a --dryrun first and it should tell you which jobs are left to be done.

2. How can I tell if my genome FASTA files are softmasked? How can I mask them if they aren't already?
2. How can I tell if my genome FASTA files are softmasked?

Cactus requires the input genomes to be softmasked. This means that masked bases appear as lower case letters: a, t, c, g. Hopefully the source of your genome FASTA file has given you some information about how it was prepared, including how it was masked. If not, a very quick method to check for the occurrence of any lower case letter in the sequence is:

if grep -q '^[^>]*[a-z]' your-genome-file.fa; then echo "The FASTA file is soft-masked."; else echo "The FASTA file is NOT soft-masked."; fi

This is fast, but only detects the occurrence of a single lower case character. To count all the lower case characters at the cost of taking a couple of minutes, you can run:

awk 'BEGIN{count=0} !/^>/{for(i=1;i<=length($0);i++) if (substr($0,i,1) ~ /[acgt]/) count++} END{print count}' your-genome-file.fa

Importantly, your genomes should not be hard masked, which means that masked bases are replaced by Ns. Unfortunately, there are many reasons for Ns to appear in a genome fasta file, so it is difficult to tell if it is because it is hardmasked based on the presence of Ns alone. Hopefully the source of the file has left some documentation describing how it was prepared...

If your genomes are not softmasked and you wish to do so, you will have to run a program like RepeatMasker or RepeatModeler on it. Please consult the documentation for these tools.

3. I want to run this on a cluster with a job scheduler other than SLURM! What do I do?
3. Clusters other than SLURM?

Generally, it should be relatively easy install and use the appropriate Snakemake cluster executor.

If you need help or run into problems, please create an issue on the pipeline's github and we'll try and help - though it will likely be up to you to test on your own cluster, since we only have easy access to a cluster running SLURM.

4. I got an error related to InsufficientSystemResources regarding GPUs during run_lastz in the blast rule. What do I do?
4. blast GPU error

If the text of the error is somewhat similar to:

toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'run_lastz' kind-run_lastz/instance-tqdjs4tj v1 is requesting [{'count': 4, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'}] accelerators, more than the maximum of [{'kind': 'gpu', 'brand': 'nvidia', 'api': 'cuda', 'count': 1}, {'kind': 'gpu', 'brand': 'nvidia', 'api': 'cuda', 'count': 1}] accelerators that SingleMachineBatchSystem was configured with. The accelerator {'count': 4, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided. Scale is set to 1.

it may be because your GPU partition is set up to use MIGs (Multi-instance GPUs). This is essentially like a multi-core CPU, except for GPUs. The problem is that Cactus isn't setup to recognize these as of the last update of this tutorial. It only recognizes physical GPU devices. For instance, let's say you've requested 4 GPUs for your blast rule by setting blast_gpu: 4 in your config file. The BLAST job is submitted to your GPU node, which allocates 2 physical GPUs set up as MIGs, each with 2 instances, which is the equivalent of 4 GPUs. However, cactus only sees the 2 physical GPUs and thinks there aren't enough to accomodate your request for 4, resulting in the error.

To resolve this, check your cluster documentation to see if there is a way to use only physical GPUs instead of MIGs. If not, you will have to set blast_gpu: 1 in your config. That is the only way to guarantee that the number of physical GPUs aligns with your request.

Read more about this here and here.

5. I tried to run the pipeline and I ran into an error that I don't understand or can't resolve. What do I do?
5. Encountering errors

Please search for or create an issue on the pipeline's github that includes information about your input files, the command you ran, and the error that you are getting. The text of any log files would also be appreciated.

Additionally, if you are at Harvard, there are several ways to contact us to help you through your errors.

6. I have an idea to improve or add to the pipeline. What do I do?
6. Pipeline improvements

Great! Please create an issue on the pipeline's github describing your idea so we can discuss its implementation!