Welcome to WEPP Wiki
Introduction
Overview
WEPP (Wastewater-Based Epidemiology using Phylogenetic Placements) is a phylogeny-based pipeline that estimates haplotype proportions from wastewater sequencing reads using a mutation-annotated tree (MAT) (Figure 1A). By improving the resolution of pathogen variant detection, WEPP enables critical epidemiological applications previously feasible only through clinical sequencing. It also flags potential novel variants via Unaccounted Mutations, which can be examined at the read level using the interactive dashboard (Figure 1B).
WEPP begins by placing reads on the mutation-annotated tree (MAT) and identifying an initial set of candidate haplotypes. It expands this set by including neighbors around each selected haplotype to form a candidate pool, which is passed to a deconvolution algorithm to estimate haplotype abundances. Haplotypes above a frequency threshold are retained, and their neighbors are again added to form a new candidate pool. This process is repeated iteratively until the haplotype set stabilizes or the maximum number of iterations is reached (Figure 1C).
Key Features
Haplotype Proportion Estimation
WEPP's Phylogenetic Placement of reads enables accurate estimation of haplotype proportions from wastewater samples. These estimates can be interactively explored using the integrated dashboard (Figure 1B(i)), which displays each haplotypeโs abundance, associated lineage, and phylogenetic uncertainty via Uncertain Haplotypes - neighboring haplotypes that cannot be confidently disambiguated.
Lineage Proportion Estimation
WEPP infers lineage proportions by combining abundances of haplotypes belonging to each lineage. This approach accounts for intra-lineage diversity, resulting in more accurate and robust estimates.
Unccounted Alleles
WEPP reports a list of Unaccounted Alleles - alleles observed in wastewater that are not explained by the selected haplotypes, along with the inferred haplotype(s) they are most likely associated with (Figure 1B(i)). These Unaccounted Alleles can serve as early indicators of novel variants and often resemble the 'cryptic' mutations described in previous studies.
Read-Level Analysis
WEPP supports detailed analysis of sequencing reads in the context of selected haplotypes (Figure 1B(ii)). It also facilitates interpretation of Unaccounted Alleles by examining their presence in reads relative to the haplotypes they are mapped to. Additional information about individual reads or haplotypes can be accessed by selecting them within the interactive panel (Figure 1B(iii) and Figure 1B(iv)).
Installation
WEPP offers multiple installation methods. Using a Docker is recommended to prevent any conflict with existing packages.
- Docker image from DockerHub
- Dockerfile
- Shell Commands
Note
โ ๏ธThe Docker image is currently built for the linux/amd64
platform. While it can run on arm64
systems (e.g., Apple Silicon or Linux aarch64) via emulation, this may lead to reduced performance.
Option-1: Install via DockerHub
The Docker image includes all dependencies required to run WEPP.
Step 1: Get the image from DockerHub
Step 2: Start and run Docker container# Use this command if your datasets can be downloaded from the Web
docker run -it pranavgangwar/wepp:latest
# Use this command if your datasets are present in your current directory
docker run -it -v "$PWD":/WEPP -w /WEPP pranavgangwar/wepp:latest
Option-2: Install via Dockerfile
The Dockerfile contains all dependencies required to run WEPP.
Step 1: Clone the repository
Step 2: Build a Docker Image Step 3: Start and run Docker container# Use this command if your datasets can be downloaded from the Web
docker run -it wepp
# Use this command if your datasets are present in your current directory
docker run -it -v "$PWD":/workspace -w /workspace wepp
Option-3: Install via Shell Commands (requires sudo access)
Users without sudo access are advised to install WEPP via Docker Image.
Step 1: Clone the repository
Step 2: Install dependencies (might require sudo access) WEPP depends on the following common system libraries, which are typically pre-installed on most development environments:- wget
- curl
- pip
- build-essential
- python3-pandas
- pkg-config
- zip
- cmake
- libtbb-dev
- libprotobuf-dev
- protobuf-compiler
- snakemake
- conda
For Ubuntu users with sudo access, if any of the required libraries are missing, you can install them with:
sudo apt-get install -y wget pip curl python3-pip build-essential python3-pandas pkg-config zip cmake libtbb-dev libprotobuf-dev protobuf-compiler snakemake
If your system doesn't have Conda, you can install it with:
wget -O Miniforge3.sh "https://github.com/conda-forge/miniforge/releases/download/24.11.3-2/Miniforge3-24.11.3-2-Linux-x86_64.sh"
bash Miniforge3.sh -b -p "${HOME}/conda"
source "${HOME}/conda/etc/profile.d/conda.sh"
source "${HOME}/conda/etc/profile.d/mamba.sh"
Quick Start
The following steps will download a real wastewater RSVA dataset and analyze it with WEPP.
Step 1: Download the test dataset
mkdir -p data/RSVA_real
cd data/RSVA_real
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR147/011/ERR14763711/ERR14763711_*.fastq.gz https://hgdownload.gi.ucsc.edu/hubs/GCF/002/815/475/GCF_002815475.1/UShER_RSV-A/2025/04/25/rsvA.2025-04-25.pb.gz https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/002/815/475/GCF_002815475.1_ASM281547v1/GCF_002815475.1_ASM281547v1_genomic.fna.gz
gunzip GCF_002815475.1_ASM281547v1_genomic.fna.gz
mv ERR14763711_1.fastq.gz ERR14763711_R1.fastq.gz
mv ERR14763711_2.fastq.gz ERR14763711_R2.fastq.gz
cd ../../
Step 2: Run the pipeline
snakemake --config DIR=RSVA_real FILE_PREFIX=test_run PRIMER_BED=RSVA_all_primers_best_hits.bed TREE=rsvA.2025-04-25.pb.gz REF=GCF_002815475.1_ASM281547v1_genomic.fna CLADE_IDX=0 --cores 32 --use-conda
Step 3: Analyze Results
All results generated by WEPP can be found in the results/RSVA_real
directory.
User Guide
Organizing Data
We assume that all wastewater samples are organized in the data
directory, each within its own subdirectory given by DIR
argument (see Run Command). For each sample, WEPP generates intermediate and output files in corresponding subdirectories under intermediate
and result
, respectively.
Each created DIR
inside data
is expected to contain the following files:
- Sequencing Reads: Ending with
*R{1/2}.fastq.gz
for paired-ended reads and*.fastq.gz
for single-ended. - Reference Genome fasta
- Mutation-Annotated Tree (MAT)
- [OPTIONAL] Genome Masking File:
mask.bed
, whose third column specifies sites to be excluded from analysis.
Visualization of WEPP's workflow directories
๐ WEPP
โโโโ๐data # [User Created] Contains data to analyze
โโโโ๐SARS-CoV-2_test_1 # SARS-CoV-2 run wastewater samples
โโโโsars_cov_2_reads.fastq.gz # Single-ended reads
โโโโsars_cov_2_reference.fa
โโโโmask.bed # OPTIONAL
โโโโsars_cov_2_mat.pb.gz
โโโโโ๐RSVA_test_1 # RSVA run wastewater samples
โโโโrsva_reads_R1.fastq.gz # Paired-ended reads
โโโโrsva_reads_R2.fastq.gz # Paired-ended reads
โโโโrsva_reference.fa
โโโโrsva_mat.pb.gz
โโโโ๐intermediate # [WEPP Generated] Contains intermediate stage files
โโโโ๐SARS-CoV-2_test_1
โโโโfile_1
โโโโfile_2
โโโโโ๐RSVA_test_1
โโโโfile_1
โโโโfile_2
โโโโ๐results # [WEPP Generated] Contains final WEPP results
โโโโ๐SARS-CoV-2_test_1
โโโโfile_1
โโโโfile_2
โโโโโ๐RSVA_test_1
โโโโfile_1
โโโโfile_2
WEPP Arguments
The WEPP Snakemake pipeline requires the following arguments, which can be provided either via the configuration file (config/config.yaml
) or passed directly on the command line using the --config
argument. The command line arguments take precedence over the config file.
DIR
- Folder name containing the wastewater readsFILE_PREFIX
- File Prefix for all intermediate filesREF
- Reference Genome in fastaTREE
- Mutation-Annotated TreeSEQUENCING_TYPE
- Sequencing read type (s:Illumina single-ended, d:Illumina double-ended, or n:ONT long reads)PRIMER_BED
- BED file for primers from theprimers
folderMIN_AF
- Alleles with an allele frequency below this threshold in the reads will be masked.MIN_Q
- Alleles with a Phred score below this threshold in the reads will be masked.MAX_READS
- Maximum number of reads considered by WEPP from the sample. Helpful for reducing runtimeCLADE_IDX
- Index used for assigning clades to selected haplotypes from MAT. Generally '1' for SARS-CoV-2 MATs and '0' for others. Could be checked by running: "matUtils summary -i {TREE} -C {FILENAME}" -> Use '0' for annotation_1 and '1' for annotation_2
Run Command
WEPP's snakemake workflow requires DIR
and FILE_PREFIX
as config arguments through the command line, while the remaining ones can be taken from the config file. It also requires --cores
from the command line, which specifies the number of threads used by the workflow.
Examples:
-
Using all the parameters from the config file
-
Overriding MIN_Q and PRIMER_BED through command line
Contributions
We welcome contributions from the community to enhance the capabilities of WEPP. If you encounter any issues or have suggestions for improvement, please open an issue on WEPP GitHub page. For general inquiries and support, reach out to our team.
Citing WEPP
TBA.