Adding an outgroup to a whole genome alignment

Once you have your HAL file with a whole genome alignment, you may find that you need to add samples to it as your research questions expand. Re-running the whole alignment pipeline may be too computationally expensive. While we provide a workflow to add genomes to branches of the tree in your HAL file, the process for adding an outgroup to the alignement is slightly different. We provide the following pipeline to implement those steps.

The cactus_add_outgroup pipeline's rulegraph

Here is the rulegraph for the pipeline. This one is fairly straightforward: the new outgroup genome is aligned to the root sequence from the original HAL file, creating a small sub-alignment with a new root. The old alignment is then merged into this one by the old root node.

A directed cyclic graph showing the rules for the pipeline.'

Getting started

If you generated your HAL file with the Cactus Snakemake pipeline

The setup for this pipeline is identical to that of the whole genome alignment pipeline and the Minigraph-cactus pipeline, so if you generated your HAL file with one of those you likely already have everything in place. Just activate your environment that has Snakemake installed:

mamba activate cactus-env

Or replace cactus-env with whatever name you gave that environment. Also, be sure you know the location of the cactus-snakemake folder. Then you can skip ahead.

If you are using a HAL file generated by some other method

You will need several things to be able to run this pipeline:

A computing cluster that uses SLURM , though it should be possible to extend it to any job scheduler that Snakemake supports .
conda or mamba to install software. See Installing command line software if you don't have conda/mamba installed.
Snakemake and the Snakemake SLURM plugin
Singularity - The pipeline itself will automatically download the latest version of the Cactus singularity container for you.
The Harvard Informatics cactus-snakemake pipeline

Below we walk you through our recommended way for getting this all set up.

Creating an enviornment for the cactus pipeline

We recommend creating a conda environment to install software:

mamba create -n cactus-env
mamba activate cactus-env
mamba install bioconda::snakemake-minimal
mamba install bioconda::snakemake-executor-plugin-slurm # For SLURM clusters

Some clusters (such as the Harvard cluster) already have Singularity installed. You should check by running the command:

singularity

If the help menu displays, you already have Singularity installed. If not, you will need to install it yourself into your cactus-env environment:

mamba install conda-forge::singularity

Cannon cluster Snakemake plugin

If you are on the Harvard Cannon cluster, instead of the generic snakemake-executor-plugin-slurm, you can use our specific plugin for the Cannon cluster: snakemake-executor-plugin-cannon . This facilitates automatic partition selection based on requested resources. Install this in your environment with:

mamba install bioconda::snakemake-executor-plugin-cannon

Then, when running the workflow, specify the cannon executor with -e cannon instead of -e slurm.

If you are not on the Harvard Cannon cluster, stick with the generic SLURM plugin. You will just need to directly specify the partitions for each rule in the config file (see below).

Downloading the cactus-snakemake pipeline

The pipeline is currently available on github. You can install it on the Harvard cluster or any computer that has git installed by navigating to the directory in which you want to download it and doing one of the following:

Using git with HTTPS

git clone https://github.com/harvardinformatics/cactus-snakemake.git

Using git with SSH

Alternatively, if you have SSH setup on github, you would type:

git clone git@github.com:harvardinformatics/cactus-snakemake.git

Using wget (without git)

If you don't have or don't wish to use git, you can directly download a ZIP archive of the repository:

wget https://github.com/harvardinformatics/cactus-snakemake/archive/refs/heads/main.zip
unzip cactus-snakemake

With that, you should be ready to set-up your data for the pipeline!

Important! cactus-snakemake v3.0.0 and later requires Cactus v2.9.9 or later.

Due to bug fixes in Cactus, v3.0.0+ of cactus-snakemake is only compatibile with Cactus v2.9.9 or later. Don't worry, cactus-snakemake will always use the latest version of Cactus available unless you specify otherwise in your config file. However, if you do wish to use an older version of Cactus, you can use cactus-snakemake v2.1.0 .

Inputs you need to prepare

To run this pipeline, you will need (corresponding Snakemake config option given in parentheses):

A HAL file with a whole genome alignment generated by Cactus (input_hal).
The location in the tree to add your alignment (see below).
The softmasked genome FASTA file for the genome you want to add to the alignment (new_genome_fasta).
A reference genome to project the alignment to MAF format (maf_reference).

The FASTA file must softmasked!

The genome you provide in FASTA format must be softmasked before running Cactus, otherwise the the program will likely not complete. You can tell if a genome FASTA file is masked by the presence of lower-case nucleotides: a, t, c, or g. If your FASTA file has these lower-case characters, it has likely been softmasked. If not, you will have to mask the genome with a tool like RepeatMasker . Also importantly, the FASTA files should not be hard masked, meaning the masked bases are replaced with Ns.

The location in the tree to add your alignment

Each HAL file requires a phylogenetic tree to generate. When adding a genome to an alignment, we also must add its location in the tree. The easiest way to know this tree and how Cactus has labeled the nodes in it is to run the command:

halStats --tree example.hal

replacing example.hal with your HAL's file name. This will print the tree to the screen. For example, in the context of the evolverMammals test data and using the singularity image, we would run the following:

singularity exec --cleanenv <cactus-image-name>.sif halStats --tree evolverMammals.hal

which would result in:

((simHuman_chr6:0.144018,(simMouse_chr6:0.084509,simRat_chr6:0.091589)mr:0.271974)Anc1:0.020593,(simCow_chr6:0.18908,simDog_chr6:0.16303)Anc2:0.032898)Anc0;

Now that we have the tree, we need to figure out where to put our new genome. We will need to come up with the following information:

The label or name of the root node in the original HAL file (old_root_name).
A tip label or name for our new genome (new_genome_name).
A label or name for the new root node in our tree (new_root_name).
The branch length of the new branch connecting the new genome to the new root (branch_length_from_new_genome).
The branch length of the new branch connectiong the old root to the new root (branch_length_from_old_root).

We borrow and slightly modify an image from the cactus documentation to visualize these pieces of information on an example tree:

Two panels, the first showing a phylogenetic tree with 3 tips and internal nodes labeled, the second showing a 4th tip being added to the tree as an outgroup.

In this context, we are adding the genome with the name "newGenome" as an outgroup to our HAL. To do so, we are creating a new root in our tree called "newRoot" and connecting it to the original root node, "5". We must supply lengths for the two new branches descending from newRoot.

No branch lengths on Cactus input tree

Cactus doesn't strictly require branch lengths on the input tree. In that case, it arbitrarily gives every branch a length of 1. If your input tree to cactus didn't have branch lengths or the output of halStats --tree shows only branch lengths of 1.0, then you don't need to specify the new branch lengths. These will automatically be set to 1.0.

Original root node

For the name of the original root node, this will usually be called either Anc0 or Anc00. To find out for certain, you can use the tree displayed above from halStats.

If you're very good at parsing Newick tree strings by eye, you may be able to see the name of the root node from that text string (here it is Anc0). However this is a relatively small tree. In most cases, you'll want to look at an image of the tree. Consider using some sort of tree viewing software like SeaView or the ape library in R . EMBL also has an online, interactive tree viewer where you can just paste the tree string to see an image of it.

The other 4 pieces of information you'll have to come up with on your own, but once you have all 5 pieces of information from the tree listed above, you're ready to move on to prepare the Cactus update input file.

Reference sample

In order to run the last step of the workflow that converts the HAL format to a readable MAF format (See pipeline outputs for more info), you will need to select one assembly as a reference assembly. The reference assembly's coordinate system will be used for projection to MAF format. You should indicate the reference assembly in the Snakemake config file (outlined below). For instance, if I wanted my reference sample in the above tree to be the genome labeled 1 in the tree, I would put the string 1 in the maf_reference line of the Snakemake config file.

Preparing the Snakemake config file

Be sure to start with the example config file as a template!

The config for the Cactus test data can be found at here or at tests/evolverMammals/evolverMammals-add-outgroup-cfg.yaml in your downloaded cactus-snakemake repo. Be sure to use this as the template for your project since it has all the options needed! Note: the partitions set in this config file are specific to the Harvard cluster. Be sure to update them if you are running this pipeline elsewhere.

Additionally, a blank template file is located here or at config-templates/add-outgroup-config-template.yaml in your downloaded cactus-snakemake repo.

Once you have all the information listed above, you can enter it into the Snakemake configuration file along with some other information to know where to look for files and write output. The config file contains 2 sections, one for specifying the input and output options, and one for specifying resources for the various rules (see below). The first part should look something like this:

cactus_path: <path/to/cactus-singularity-image OR download OR a version string (e.g. 2.9.5)>

cactus_gpu_path: <path/to/cactus-GPU-singularity-image OR download OR a version string (e.g. 2.9.5)>

input_hal: <path/to/hal-file>

old_root_name: <label of the root in the input_hal>

new_genome_name: <tip label of new genome>

new_genome_fasta: <path/to/new/genome.fasta>

new_root_name: <label for new root node connected to the new genome and the old root>

branch_length_from_new_genome: <new branch length to connect the new genome to the new root>

branch_length_from_old_root: <new branch length to connect the old root to the new root>

output_dir: <path/to/desired/output-directory>

final_prefix: <desired name of final .hal and .maf files with all genomes appended>

maf_reference: <Genome ID from input_file>

tmp_dir: <path/to/tmp-dir/>

use_gpu: <True/False>

Simply replace the string surrounded by <> with the path or option desired. Below is a table summarizing these options:

Option	Description
`cactus_path`	Path to the Cactus Singularity image. If blank or 'download', the image of the latest Cactus version will be downloaded and used. If a version string is provided (e.g. 2.9.5), then that version will be downloaded and used. This will be used whether `use_gpu` is True or False.
`cactus_gpu_path`	Path to the Cactus GPU Singularity image. If blank or 'download', the image of the latest Cactus version will be downloaded and used. If a version string is provided (e.g. 2.9.5), then that version will be downloaded and used. This will only be used if `use_gpu` is True.
`input_hal`	Path to the previously generated HAL file to which you want to add a new genome.
`old_root_name`	The label of the root in the input HAL, can be retrieved with `halStats --tree`.
`new_genome_name`	The label to give your new genome in the tree and HAL file.
`new_genome_fasta`	The path to the FASTA file containing the new genome to add to the alignment.
`new_root_name`	The name to give the new root node connected to the new genome and the old root.
`branch_length_from_new_genome`	The length of the branch to create that will connect your new genome to the new root (default: 1.0).
`branch_length_from_old_root`	The length of the branch to create that will connect the old root to the new root (default: 1.0).
`output_dir`	Directory where the all output will be written.
`final_prefix`	The name of the final .hal (if `overwrite_original_hal` is `False`) and .maf files with all aligned genomes appended. The final files will be `<final_prefix>.hal` and `<final_prefix>.maf`. These files will be placed within `output_dir`.
`tmp_dir`	A temporary directory for Snakemake and Cactus to use. Should have lots of space.
`use_gpu`	Whether to use the GPU version of Cactus for the alignment (True/False).

Specifying resources for each rule

Below these options in the config file are further options for specifying resource usage for each rule that the pipeline will run. For example:

rule_resources:
  preprocess:
    partition: shared
    mem_mb: 25000
    cpus: 8
    time: 30

  blast:
    partition: gpu    # If use_gpu is True, this must be a partition with GPUs
    mem_mb: 50000
    cpus: 48
    gpus: 1           # If use_gpu is False, this will be ignored
    time: 30

The rule blast is the only one that uses GPUs if use_gpu is True.

Notes on resource allocation

Be sure to use partition names appropriate your cluster. Several examples in this tutorial have partition names that are specific to the Harvard cluster, so be sure to change them.
Allocate the proper partitions based on use_gpu. If you want to use the GPU version of cactus (i.e. you have set use_gpu: True in the config file), the partition for the rule blast must be GPU enabled. If not, the pipeline will fail to run.
The blast: gpus: option will be ignored if use_gpu: False is set.
mem is in MB and time is in minutes.
If using the snakemake-executor-plugin-cannon specifically for the Harvard Cannon cluster, you can leave the partition: fields blank and one will be selected automatically based on the other resources requested!

You will have to determine the proper resource usage for your dataset. Generally, the larger the genomes, the more time and memory each job will need, and the more you will benefit from providing more CPUs and GPUs.

We have an example of resource allocation and usage for creation of a new HAL file. Since updating a HAL file uses the same steps, just with fewer rounds, you can gauge resource usage with this example.

Running the pipeline

With everything installed and the Snakemake configuration file setup, you are now ready to run the pipeline.

Do a `--dryrun` first

First, we want to make sure everything is setup properly by using the --dryrun option. This tells Snakemake to display what jobs it is going to run without actually submitting them. This is important to do before actually submitting the jobs so we can catch any setup errors beforehand.

This is done with the following command, changing the snakefile -s and --configfile paths to the one you have created for your project:

snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_add_outgroup.smk> --configfile <path/to/your/snakmake-config.yml> --dryrun

Command breakdown

Command-line option	Description
`snakemake`	The call to the snakemake workflow program to execute the workflow.
`-j <# of jobs to submit simultaneously>`	The maximum number of jobs that will be submitted to your SLURM cluster at one time.
`-e slurm`	Specify to use the SLURM executor plugin, or use `-e cannon` if using the Cannon specific plugin. See: Getting started
`-s </path/to/cactus_add_outgroup.smk>`	The path to the workflow file.
`--configfile <path/to/your/snakmake-config.yml>`	The path to your config file. See: Preparing the Snakemake config file.
`--dryrun`	Do not execute anything, just display what would be done.

This command won't actually submit the pipeline jobs!

However even during a --dryrun some pre-processing steps will be run, including creation of the output directory if it doesn't exist, downloading the Cactus Singularity image if cactus_path: download is set in the config file, and creation of some small input files. These should all be relatively fast and not resource intensive tasks.

If this completes successfully, you should see a bunch of blue, yellow, and green text on the screen, ending with something like this (the number of jobs and Reasons: may differ for your project):

Job stats:
job               count
--------------  -------
align                 1
all                   1
blast                 1
combine_hals          1
get_root_fasta        1
maf                   1
preprocess            1
total                 7

Reasons:
    (check individual jobs above for details)
    input files updated by another job:
        align, all, blast, combine_hals, maf, preprocess
    output files have to be generated:
        align, blast, combine_hals, get_root_fasta, maf, preprocess
This was a dry-run (flag -n). The order of jobs does not reflect the order of execution.

If you see any red text, that likely means an error has occurred that must be addressed before you run the pipeline.

Submitting the jobs

If you're satisfied that the --dryrun has completed successfully and you are ready to start submitting Cactus jobs to the cluster, you can do so by simply removing the --dryrun option from the command above:

snakemake -j <# of jobs to submit simultaneously> -e slurm -s </path/to/cactus_add_outgroup.smk> --configfile <path/to/your/snakmake-config.yml>

This will start submitting jobs to SLURM. On your screen, you will see continuous updates regarding job status in blue text. In another terminal, you can also check on the status of your jobs by running squeue -u <your user id>.

Be sure you have a way to keep the main Snakemake process running.

Remember, that some of these steps can take a long time, or your jobs may be stuck in the queue for a while. That means the Snakemake job must have a persistent connection in order to keep running an submitting jobs. There are a few ways this can be accomplished, some better than ~~others~~:

~~Keep your computer on and connected to the cluster. This may be infeasible and you may still suffer from connection issues.~~
Run the Snakemake job in the background by using nohup <snakemake command> &. This will run the command in the background and persist even if you disconnect. However, it makes it difficult to check on the status of your command.
Submit the Snakemake command itself as a SLURM job. This will require preparing and submitting a job script . This is a good solution.
Use a terminal multiplexer like GNU Screen or tmux . These programs allow you to open a sub-terminal within your currently connected terminal that remains even after you disconnect from the server. This is also a good solution.

Depending on the number of genomes, their sizes, and your wait in the queue, you will hopefully have your updated whole genome alignment within a few hours!

Test dataset

Cactus provides a test dataset which we have setup to run in the tests/evolverMammals/ folder.

Here is a breakdown of the files so you can investigate them and prepare similar ones for your project:

File/directory	Description
`evolverMammals-seq/`	This directory contains the input sequence files for the test dataset in FASTA format.
`evolverMammals-update-cfg.yaml`	This is the config file for Snakemake and has all of the options you would need to setup for your own project.

You will first need to run the test to generate the HAL file. Then, you can add the Gorilla sequence to it using this pipeline. We recommend running this test dataset before setting up your own project.

After you've generated the HAL file, to add a genome as an outgroup, open the config file, tests/evolverMammals/evolverMammals-add-outgroup-cfg.yaml and make sure the partitions are set appropriately for your cluster. For this small test dataset, it is appropriate to use any "test" partitions you may have. Then, update the path to tmp_dir to point to a location where you have a lot of temporary space. Even this small dataset will fail if this directory does not have enough space.

After that, run a dryrun of the test dataset by changing into the tests/evolverMammals directory and running:

cd tests/evolverMammals/
snakemake -j 10 -e slurm -s ../../cactus_add_outgroup.smk --configfile evolverMammals-cfg-add-outgroup.yaml --dryurun

If this completes without error, run the pipeline by removing the --dryrun option:

snakemake -j 10 -e slurm -s ../../cactus_add_outgroup.smk --configfile evolverMammals-cfg-add-outgroup.yaml

Pipeline outputs

The pipeline will output a .fa file for the old root node, a .paf for the new root node, and a .hal that includes the original HAL alignment along with the added genome and new root node. The final alignment file will be <final_prefix>.hal, where <final_prefix> is whatever you specified in the Snakemake config file. Otherwise, the original HAL will be modified in place.

The original HAL file will essentially be copied

In order to add the new genome as an outgroup, the old HAL file is appended to the new one, leaving the old HAL file in place. While this preserves the old HAL file, it does create a copy of a potentially large file. Be aware of your data storage capacity as you run this pipeline.

The final alignment will also be presented in MAF format as <final_prefix>.<maf_reference>.maf, again where <maf_reference> is whatever you set in the Snakemake config. This file will include all sequences. Another MAF file, <final_prefix>.<maf_reference>.nodupes.maf will also be generated, which is the alignment in MAF format with no duplicate sequences. The de-duplicated MAF file is generated with --dupeMode single. See the Cactus documentation regarding MAF export for more info.

A suite of tools called HAL tools is included with the Cactus singularity image if you need to manipulate or analyze .hal files. There are many tools for manipulating MAF files, though they are not always easy to use. The makers of Cactus also develop taffy , which can manipulate MAF files by converting them to TAF files.

Questions/troubleshooting

1. My jobs were running but my Snakemake process crashed because of connection issues/server maintenance! What do I do?

1. Snakemake crashes

As long as there wasn't an error with one of the jobs, Snakemake is designed to be able to resume and resubmit jobs pretty seamlessly. You just need to run the same command you ran to begin with and it should pickup submitting jobs where it left off. You could also run a --dryrun first and it should tell you which jobs are left to be done.

2. How can I tell if my genome FASTA files are softmasked? How can I mask them if they aren't already?

2. How can I tell if my genome FASTA files are softmasked?

Cactus requires the input genomes to be softmasked. This means that masked bases appear as lower case letters: a, t, c, g. Hopefully the source of your genome FASTA file has given you some information about how it was prepared, including how it was masked. If not, a very quick method to check for the occurrence of any lower case letter in the sequence is:

if grep -q '^[^>]*[a-z]' your-genome-file.fa; then echo "The FASTA file is soft-masked."; else echo "The FASTA file is NOT soft-masked."; fi

This is fast, but only detects the occurrence of a single lower case character. To count all the lower case characters at the cost of taking a couple of minutes, you can run:

awk 'BEGIN{count=0} !/^>/{for(i=1;i<=length($0);i++) if (substr($0,i,1) ~ /[acgt]/) count++} END{print count}' your-genome-file.fa

Importantly, your genomes should not be hard masked , which means that masked bases are replaced by Ns. Unfortunately, there are many reasons for Ns to appear in a genome fasta file, so it is difficult to tell if it is because it is hardmasked based on the presence of Ns alone. Hopefully the source of the file has left some documentation describing how it was prepared...

If your genomes are not softmasked and you wish to do so, you will have to run a program like RepeatMasker or RepeatModeler on it. Please consult the documentation for these tools.

3. I want to run this on a cluster with a job scheduler other than SLURM! What do I do?

3. Clusters other than SLURM?

Generally, it should be relatively easy install and use the appropriate Snakemake cluster executor .

If you need help or run into problems, please create an issue on the pipeline's github and we'll try and help - though it will likely be up to you to test on your own cluster, since we only have easy access to a cluster running SLURM.

4. I got an error related to InsufficientSystemResources regarding GPUs during run_lastz in the blast rule. What do I do?

4. `blast` GPU error

If the text of the error is somewhat similar to:

toil.batchSystems.abstractBatchSystem.InsufficientSystemResources: The job 'run_lastz' kind-run_lastz/instance-tqdjs4tj v1 is requesting [{'count': 4, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'}] accelerators, more than the maximum of [{'kind': 'gpu', 'brand': 'nvidia', 'api': 'cuda', 'count': 1}, {'kind': 'gpu', 'brand': 'nvidia', 'api': 'cuda', 'count': 1}] accelerators that SingleMachineBatchSystem was configured with. The accelerator {'count': 4, 'kind': 'gpu', 'api': 'cuda', 'brand': 'nvidia'} could not be provided. Scale is set to 1.

it may be because your GPU partition is set up to use MIGs (Multi-instance GPUs) . This is essentially like a multi-core CPU, except for GPUs. The problem is that Cactus isn't setup to recognize these as of the last update of this tutorial. It only recognizes physical GPU devices. For instance, let's say you've requested 4 GPUs for your blast rule by setting blast_gpu: 4 in your config file. The BLAST job is submitted to your GPU node, which allocates 2 physical GPUs set up as MIGs, each with 2 instances, which is the equivalent of 4 GPUs. However, cactus only sees the 2 physical GPUs and thinks there aren't enough to accomodate your request for 4, resulting in the error.

To resolve this, check your cluster documentation to see if there is a way to use only physical GPUs instead of MIGs. If not, you will have to set blast_gpu: 1 in your config. That is the only way to guarantee that the number of physical GPUs aligns with your request.

5. Encountering errors

Please search for or create an issue on the pipeline's github that includes information about your input files, the command you ran, and the error that you are getting. The text of any log files would also be appreciated.

Additionally, if you are at Harvard, there are several ways to contact us to help you through your errors.

6. I have an idea to improve or add to the pipeline. What do I do?

6. Pipeline improvements

Great! Please create an issue on the pipeline's github describing your idea so we can discuss its implementation!