User Guide
Organizing Data
We assume that all wastewater samples are organized in the data directory, each within its own subdirectory given by DIR argument (see Run Command). For each sample, WEPP generates intermediate and output files in corresponding subdirectories under intermediate and results, respectively.
Each created DIR inside data is expected to contain the following files:
- Sequencing Reads: Ending with
*_R{1/2}.fastq.gzfor paired-ended reads and*.fastq.gzfor single-ended. - Reference Genome in FASTA format
- Mutation-Annotated Tree (MAT)
- [OPTIONAL] Genome Masking File:
mask.bed, whose third column specifies sites to be excluded from analysis. - [OPTIONAL] Taxonium
.jsonlfile to be used for visualizing results in the WEPP dashboard.
Visualization of WEPP's workflow directories
๐ WEPP
โโโโ๐data # [User Created] Contains data to analyze
โโโโ๐SARS_COV_2_real # SARS-CoV-2 run wastewater samples - 1
โโโโsars_cov_2_reads_R1.fastq.gz # Paired-ended reads
โโโโsars_cov_2_reads_R2.fastq.gz
โโโโsars_cov_2_reference.fa
โโโโmask.bed # OPTIONAL
โโโโsars_cov_2_taxonium.jsonl.gz # OPTIONAL
โโโโsars_cov_2_mat.pb.gz
โโโโ๐intermediate # [WEPP Generated] Contains intermediate stage files
โโโโ๐SARS_COV_2_real
โโโโfile_1
โโโโfile_2
โโโโ๐results # [WEPP Generated] Contains final WEPP results
โโโโ๐SARS_COV_2_real
โโโโfile_1
โโโโfile_2
WEPP Arguments
The WEPP Snakemake pipeline requires the following arguments, which can be provided either via the configuration file (config/config.yaml) or passed directly on the command line using the --config argument. The command line arguments take precedence over the config file.
DIR- Folder name containing the wastewater reads.FILE_PREFIX- File Prefix for all intermediate files.REF- Reference Genome in fasta.TREE- Mutation-Annotated Tree.SEQUENCING_TYPE- Sequencing read type (s:Illumina single-ended, d:Illumina double-ended, or n:ONT long reads).PRIMER_BED- BED file argument for primers, with few primers provided in theprimersfolder. Requires path to the file.MIN_AF- Alleles with an allele frequency below this threshold in the reads will be masked (Illumina: 0.5%, Ion Torrent: 1.5%, ONT: 2%).MIN_DEPTH- Sites with read depth below this threshold will be masked.MIN_Q- Alleles with a Phred score below this threshold in the reads will be masked.MIN_PROP- Minimum Proportion of haplotypes (Wastewater Samples: 0.5%, Clinical Samples: 5%).MIN_LEN- Minimum read length to be considered after ivar trim (Deafult: 80).MAX_READS- Maximum number of reads considered by WEPP from the sample. Helpful for reducing runtime.CLADE_LIST- List the clade annotation schemes used in the MAT. SARS-CoV-2 MAT uses both nextstrain and pango lineage naming systems, so use "nextstrain,pango" for it.CLADE_IDX- Index used for assigning clades to selected haplotypes from MAT. Use '1' for Pango naming and '0' for Nextstrain naming for SARS-CoV-2. Other pathogens usually follow a single lineage annotation system, so work with '0'. In case of NO lineage annotations, use '-1'. Lineage Annotations could be checked by running: "matUtils summary -i {TREE} -C {FILENAME}" -> Use '0' for annotation_1 and '1' for annotation_2.DASHBOARD_ENABLED- Set toTrueto enable the interactive dashboard for viewing WEPP results, orFalseto disable it.TAXONIUM_FILE[Optional] - Name of the user-provided Taxonium.jsonlfile for visualization. If specified, this file will be used instead of generating a new one from the given MAT. Ensure that the provided Taxonium file corresponds to the same MAT used for WEPP.
Run Command
WEPP's snakemake workflow requires DIR and FILE_PREFIX as config arguments through the command line, while the remaining ones can be taken from the config file. It also requires --cores from the command line, which specifies the number of threads used by the workflow.
Examples:
-
Using all the parameters from the config file.
-
Overriding MIN_Q and CLADE_IDX through command line.
-
To visualize results from a previous WEPP analysis that was run without the dashboard, set
DASHBOARD_ENABLEDtoTrueand re-run only the dashboard components, without reanalyzing the dataset.
Note
โ ๏ธ Use the same configuration parameters (DIR, FILE_PREFIX, etc.) as were used for the specific project. This ensures the dashboard serves the correct results for your chosen dataset. โ ๏ธ Make sure port forwarding is enabled when running on external servers to view results on your personal machine.
Getting Mutation-Annotated Trees
Mutation-annotated trees (MAT) for different pathogens are maintained by the UShER team, which can be found here. You can also create your own MAT for any pathogen from the consensus genome assemblies using viral_usher.
Analyzing WEPP's Results
WEPP generates output files for each sample in its corresponding subdirectory under results. Some of the key files are described below:
lineage_abundance.csv- Reports the estimated abundance of different lineages detected in the wastewater sample.haplotype_abundance.csv- Provides the abundance and lineage information for each selected haplotype (internal nodes or clinical sequences) inferred from the wastewater sample.-
haplotype_uncertainty.csv- Lists the maximum single-nucleotide distance and all haplotypes that could not be distinguished from one another for each selected haplotype.- A non-zero nucleotide distance indicates that sequencing reads did not cover the distinguishing sites between haplotypes.
- A zero distance indicates that the haplotypes are identical.
-
haplotype_coverage.csv- Contains the fraction of each selected haplotype that is supported by parsimonious read-to-haplotype assignments. unaccounted_alleles.txt- Reports the residue, allele frequency, and sequencing depth for each unaccounted allele detected in the sample. Alleles with higher residue values are more likely to originate from a novel variant, as they were not adequately represente by the selected haplotypes.
Note
โ ๏ธ All of these results can be easily explored and visualized through the dashboard.
Running WEPP Dashboard
The WEPP dashboard provides an interactive interface for exploring inferred haplotypes, lineage abundances, and unaccounted alleles. You can either run it as part of a WEPP analysis or launch it locally after completing a WEPP run on your server by copying the results directory.
Option 1: Run the dashboard during a WEPP run
Step 1: Enable the dashboard in your WEPP command.
Step 2: If you are running WEPP on a remote machine, use the SSH port forwarding to access the dashboard in your local browser. Step 3: Open the dashboard in your browser at:Note
โ ๏ธ Replace 8080 with any available local port.
Option 2: Visualize results on your local machine (requires Docker)
You can also analyze your samples with WEPP on a server and run the dashboard locally.
Step 1: Enable the dashboard in your WEPP command.
Step 2: Copy the results directory to your local machine.
Run the following command on your local computer.
Step 3: Pull the WEPP dashboard Docker image.
Step 4: Launch the dashboard by running the container and mounting the results directory.
docker run -it \
-v "/path_to_local_directory/results:/app/taxonium_backend/results" \
-p 8080:80 \
pratikkatte7/wepp-dashboard
Step 5: Open the dashboard in your browser at:
Note
โ ๏ธ Replace 8080 with any available local port. For additional details and advanced usage, see the WEPP Dashboard repository.
Debugging Tips
In case of a failure or unexpected output, below are some common causes and possible solutions.
Run Failure- Check whether reads were successfully aligned by minimap2 by inspecting thealignment.samfile in theintermediatedirectory. If enough reads are present but thefilterrule crashes immediately, the sample may contain more reads than WEPP can efficiently handle. Use theMAX_READSparameter to downsample the input. For typical short-read datasets, setting this to ~3 million reads generally works well.Missing Lineages- If expected lineages are absent from thelineage_abundance.csvin theresultsdirectory, either the MAT does not contain any lineage annotations, or an incorrectCLADE_IDXargument was provided.-
Uncertainty in Lineages and Haplotypes- Uncertain lineage or haplotype assignments usually occur when:- Sequencing depth is low
- Entire genome is not sufficiently covered
- Read quality is poor and iVar trims and discards a large number of reads.
Check
lineage_abundance.csv,haplotype_abundance.csv, andhaplotype_uncertainty.csvin theresultsdirectory. You can also reviewalignment.samin theintermediatedirectory to see how many reads were used by WEPP and compare them with the reads provided as input. -
Long Runtimes- You can increase the number of threads using the--coresargument, or reduce the number of reads withMAX_READS(may affect results).