Inferring Recombinants Using the RIVET Backend

Infer recombinant ancestry in your own SARS-CoV-2 sequences using RIVET's backend.

Installation

Make sure RIVET is installed on your local machine before proceeding, otherwise install RIVET first by following these steps: Install RIVET On Your Machine

RIVET Backend

The RIVET backend uses RIPPLES for SARS-CoV-2 recombination detection. For more information on the RIPPLES algorithm please see: Pandemic-Scale Phylogenomics Reveals The SARS-CoV-2 Recombination Landscape

RIVET Architecture

A. RIPPLES Job Orchestrator When running a RIVET job on Google Cloud Platform (GCP), RIVET calculates the number of long branches in the input mutation-annotated tree and partitions them across n GCP instances, which is a parameter specified by the user. This stage of the pipeline is responsible for setting up and launching these parallel jobs, as well as monitoring their progress as they run. This stage of the pipeline also initiates a Chronumental job, to run concurrently as a subprocess on the local machine, which is explained in the following part B.

B. Infer MAT nodes ancestral dates In order to infer the emergence of detected ancestral recombinant nodes of interest for ranking and epidemiological prioritization, RIVET builds a time-tree using the Chronumental method. This method uses the sample dates provided in the sequence metadata file to build a probabilistic model for length of time across branches in the tree and is able to infer the dates of all internal nodes in the tree. RIVET uses these dates for internal nodes that we label as recombinants.

C. Mult-node GCP Workflow When running a RIVET job on GCP, the RIPPLES recombinant search and subsequent filtration pipeline utilizes multi-node parallelism. The degree of speedup depends on how many GCP instances the user decides to allocate towards the job, since the MAT long branches to search will be automatically partitioned across the given n machines. On each instance, once a putative list of recombinant nodes is obtained, the pipeline on that machine begins quality control and filtration checks to flag false-positive recombinants.

D. Post-filtration Aggegrator and Ranking This is the last stage of the pipeline and it occurs on your local machine, for both on-premise and GCP RIVET workflows. Once the recombination search and filtration steps of the pipeline have concluded on all instances and the local Chronumental job has finished, the filtered recombinant results for each partition of long branches are aggregated locally and the post-filtration stage of the pipeline can begin. During this last step, the final list of recombinants is ranked according to a growth metric and also additional information on each recombinant is gathered, such as clade/lineage information, descendant samples, parsimony scores, quality control/filtration information, and more. For a full list of all information reported about each putative recombinant, please see our documentation about the RIVET Results Table.

RIVET Backend Input

Warning

All input files should be placed in the current directory where you will launch your RIVET workflow.

If using GCP: The following input files with the same naming as you specify in the config.yaml file below need to be placed in your GCP Storage Bucket (bucket_id) before launching the remote RIVET job.

UShER Mutation-Annotated Tree (MAT): Updated daily and can be obtained here: SARS-CoV-2 global MAT
- RIVET performs recombination search using RIPPLES over an UShER mutation-annotated tree (MAT). Any samples you wish to search for recombinant ancestry must first be added to the MAT using UShER.
Sequence Metadata: Also updated daily to match the sequences in the corresponding MAT and can be obtained here: metadata
- The sequence metadata is a TSV file containing information about each sample in the MAT, including its name, date sequenced, country sequenced, and clade/lineage information. This information is used throughout the RIVET backend pipeline, for inferring the recombinant ancestor emergence date for example.
Sequence Files (FASTA): Downloadable at the following links, for a given $TREE_DATE (eg. 2022-07-04)
- https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/ncbi.$TREE_DATE/genbank.fa.xz
- https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/cogUk.$TREE_DATE/cog_all.fasta.xz

Info

During the RIVET backend quality control and filtration pipeline these sample sequence files are aligned to the SARS-CoV-2 reference and the RIPPLES inferred recombinantion-informative sites are inspected for bioinformatic and sequencing error quality issues to flag false-positive recombinants.

Example

To download the SARS-CoV-2 Genbank sequences for 2022-07-04:

wget https://hgwdev.gi.ucsc.edu/~angie/sarscov2phylo/ncbi.2022-07-04/genbank.fa.xz

Launch RIVET Job

The RIVET backend is setup to be run locally on your own machine or on Google Cloud Platform (GCP), and for ease-of-use is entirely configured through the use of the config.yaml file.

Setup

If you would like to run your RIVET backend job on Google Cloud Platform, please see the following documentation for setting up an account: GCP Setup Docs

Copy the config file from template/config.yaml into the current directory and fill out the fields. More information on each field can be found below.

cp template/config.yaml .

# GCP Credentials [LEAVE EMPTY FOR LOCAL JOB]
bucket_id:
project_id:
key_file: /tmp/keys/

# GCP Machine and Storage Bucket Config [LEAVE EMPTY FOR LOCAL JOB]
instances:
boot_disk_size: 50
machine_type:

# Ripples Parameters Config [REQUIRED]
version: ripples-fast
mat:
newick:
metadata:
date:
# Local results output directory, or name of folder on GCP storage bucket
results:
reference: reference.fa

# Additional Parameters
num_descendants: 5
public_tree: True
verbose: False
# Default to all available threads if left empty
threads:
docker_image: mrkylesmith/ripples_pipeline:latest
generate_taxonium: False

Fill out the configuration file with the settings for your RIVET job. If the field is already filled in, you will likely not want to change that parameter value.

Info

For more information on each field in the config.yaml file please see the following page: RIVET Backend Configuration

RIVET Backend Outputs

The pipeline will create a local results directory, based on the name given for the results field in config.yaml

The pipeline will automatically output the following four files within your local results directory (and in GCP bucket if running remote job):

final_recombinants_<DATE>.txt: a TSV file containing the detected recombinants, with the recombinant node id, donor node id and acceptor node id as the first three columns in the file. The rest of the columns contain information about each detected recombinant, including clade/lineage assignments, 3SEQ M,N,K and p-values, a representative descendant (containing the fewest additional mutations with respect to the recombinant node), recombinant ranking scores, and other information to be displayed by the RIVET frontend. For more information on this file, please see the RIVET Results Table page.

trios.vcf: VCF file containing the SNVs of each trio (recombinant and its parents) node.

sample_descedants.txt.xz: a TSV file where each row contains a mapping from each trio node id (one node id per row), to a set of descendant samples corresponding to that internal node id.

<DATE>.taxonium.jsonl.gz: a jsonl file used by RIVET frontend to display the recombinant node trios within the context of the global phylogeny, powered by Taxonium and Treenome.

Note

Currently the Taxonium view is only provided using public trees provided at: https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/