RIVET Backend Configuration File

Below you will find explainations for each field in the following config.yaml file.

# GCP Credentials [LEAVE EMPTY FOR LOCAL JOB]
bucket_id:
project_id:
key_file: /tmp/keys/

# GCP Machine and Storage Bucket Config [LEAVE EMPTY FOR LOCAL JOB]
instances:
boot_disk_size: 50
machine_type:

# Ripples Parameters Config [REQUIRED]
version: ripples-fast
mat:
newick:
metadata:
date:
# Local results output directory, or name of folder on GCP storage bucket
results:
reference: reference.fa

# Additional Parameters
num_descendants: 5
public_tree: True
verbose: False
# Default to all available threads if left empty
threads:
docker_image: mrkylesmith/ripples_pipeline:latest
generate_taxonium: False

RIVET GCP Job Parameters

Warning

If you are running your RIVET backend job on GCP, you must fill out all of the fields in this subsection. Otherwise, if you are running your RIVET job locally on your machine, just leave these fields blank.

bucket_id: The name of the GCP Storage Bucket where RIVET will find your pipeline inputs, and write the outputs of the pipeline.
project_id: The name of your GCP project, where your Storage Bucket can be found.
key_file: Location (path) to find a GCP authentication keys JSON file, that will give RIVET the necessary permissions to access your GCP account and storage bucket.
instances: The number of GCP instances (machines) to parallelize your RIVET job across. RIVET will automatically partition the number of long branches in the given MAT across n instances given by this field and search for recombination events and perform filtration checks in parallel on n machines.
boot_disk_size: 50 This field should be left as 50, and pertains only to GCP machines.
machine_type: n2d-highcpu-32 The types of GCP machine to use for RIVET job. We recommbend leaving this field as n2d-highcpu-32 machine, since RIVET is optimized to take advantage of GCP compute optimized instances, but this field can be changed if desired. The list of available machines can be found at the following page: Machine families resource and comparison guide

Info

For more information on GCP acount setup including obtaining the necessary key_file, please see the GCP Setup Docs

RIVET Specific Parameters

version: ripples-fast Do not change this field. We recommend using ripples-fast, which is a new implementation of the RIPPLES algorithm that produces identical results with considerable speedup.
mat: The mutation-annotated tree (MAT) input phylogeny generated by UShER to search for recombination. A daily-updated database of SARS-CoV-2 mutation-annotated trees has been made available through matUtils and can be found here: https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/.
newick: The name of the Newick tree file that will be used by the RIVET backend pipeline. Could be named <DATE>_tree.nwk for example. No actual input file is required for this field, just provide the name of the file, and RIVET will convert to the Newick file format internally.
metadata: Provide the name of the sequence metadata file you obtained here: metadata. This is a TSV file containing information about each sample in the MAT, including its name, date sequenced, country sequenced, and clade/lineage information. This information is used throughout the RIVET backend pipeline, for inferring the recombinant ancestor emergence date for example.
date: The date corresponding to the input MAT and metadata files used, in the following format year-month-day. Eg.) 2023-06-01
results: The name of directory to write all RIVET output files to, both locally and in GCP storage bucket if running remote job.
reference: reference.fa The name of the SARS-CoV-2 reference file, that will be automatically downloaded by the RIVET pipeline. For SARS-CoV-2 recombination inference, we recommend not changing this field.
num_descendants: 5 The minimum number of leaves that a node should have to be considered for recombination.
public_tree: True This field should be set to True if the MAT was obtained at the following link: https://hgdownload.soe.ucsc.edu/goldenPath/wuhCor1/UShER_SARS-CoV-2/.
verbose: False If set to False, most standard out information will be written to log files, instead of printed to the console during the pipeline execution.
threads: As many stages of RIVET and RIPPLES are multithreaded, this field sets the number of threads to use when running RIVET locally. If this field is left blank, the number of threads will automatically equal the number of available cores on the machine.
docker_image: mrkylesmith/ripples_pipeline:latest The public Docker image for RIVET that will be used when executing the pipeline on GCP. Do not change this field.
generate_taxonium: False When set to True, RIVET will generate a Taxonium jsonl file that can be loaded into the Taxonium web interface or desktop app to view the global phylogeny for the given input MAT.