Skip to content

User Guide

Organizing Data

We assume that all wastewater samples are organized in the data directory, each within its own subdirectory given by DIR argument (see Run Command). For each sample, WEPP generates intermediate and output files in corresponding subdirectories under intermediate and result, respectively.

Each created DIR inside data is expected to contain the following files:

  1. Sequencing Reads: Ending with *_R{1/2}.fastq.gz for paired-ended reads and *.fastq.gz for single-ended.
  2. Reference Genome in FASTA format
  3. Mutation-Annotated Tree (MAT)
  4. [OPTIONAL] Genome Masking File: mask.bed, whose third column specifies sites to be excluded from analysis.
  5. [OPTIONAL] Taxonium .jsonl file to be used for visualizing results in the WEPP dashboard.

Visualization of WEPP's workflow directories

πŸ“ WEPP
β””β”€β”€β”€πŸ“data                                   # [User Created] Contains data to analyze 
    β”œβ”€β”€β”€πŸ“SARS_COV_2_real                    # SARS-CoV-2 run wastewater samples - 1
         β”œβ”€β”€β”€sars_cov_2_reads_R1.fastq.gz    # Paired-ended reads
         β”œβ”€β”€β”€sars_cov_2_reads_R2.fastq.gz
         β”œβ”€β”€β”€sars_cov_2_reference.fa 
         β”œβ”€β”€β”€mask.bed                        # OPTIONAL 
         β”œβ”€β”€β”€sars_cov_2_taxonium.jsonl.gz    # OPTIONAL 
         └───sars_cov_2_mat.pb.gz

β””β”€β”€β”€πŸ“intermediate                           # [WEPP Generated] Contains intermediate stage files 
    β”œβ”€β”€β”€πŸ“SARS_COV_2_real                
         β”œβ”€β”€β”€file_1
         └───file_2

β””β”€β”€β”€πŸ“results                                # [WEPP Generated] Contains final WEPP results
    β”œβ”€β”€β”€πŸ“SARS_COV_2_real                
         β”œβ”€β”€β”€file_1
         └───file_2

WEPP Arguments

The WEPP Snakemake pipeline requires the following arguments, which can be provided either via the configuration file (config/config.yaml) or passed directly on the command line using the --config argument. The command line arguments take precedence over the config file.

  1. DIR - Folder name containing the wastewater reads.
  2. FILE_PREFIX - File Prefix for all intermediate files.
  3. REF - Reference Genome in fasta.
  4. TREE - Mutation-Annotated Tree.
  5. SEQUENCING_TYPE - Sequencing read type (s:Illumina single-ended, d:Illumina double-ended, or n:ONT long reads).
  6. PRIMER_BED - BED file argument for primers, with few primers provided in the primers folder. Requires path to the file.
  7. MIN_AF - Alleles with an allele frequency below this threshold in the reads will be masked (Illumina: 0.5%, Ion Torrent: 1.5%, ONT: 2%).
  8. MIN_DEPTH - Sites with read depth below this threshold will be masked.
  9. MIN_Q - Alleles with a Phred score below this threshold in the reads will be masked.
  10. MIN_PROP - Minimum Proportion of haplotypes (Wastewater Samples: 0.5%, Clinical Samples: 5%).
  11. MIN_LEN - Minimum read length to be considered after ivar trim (Deafult: 80).
  12. MAX_READS - Maximum number of reads considered by WEPP from the sample. Helpful for reducing runtime.
  13. CLADE_LIST - List the clade annotation schemes used in the MAT. SARS-CoV-2 MAT uses both nextstrain and pango lineage naming systems, so use "nextstrain,pango" for it.
  14. CLADE_IDX - Index used for assigning clades to selected haplotypes from MAT. Use '1' for Pango naming and '0' for Nextstrain naming for SARS-CoV-2. Other pathogens usually follow a single lineage annotation system, so work with '0'. In case of NO lineage annotations, use '-1'. Lineage Annotations could be checked by running: "matUtils summary -i {TREE} -C {FILENAME}" -> Use '0' for annotation_1 and '1' for annotation_2.
  15. DASHBOARD_ENABLED - Set to True to enable the interactive dashboard for viewing WEPP results, or False to disable it.
  16. TAXONIUM_FILE [Optional] - Name of the user-provided Taxonium .jsonl file for visualization. If specified, this file will be used instead of generating a new one from the given MAT. Ensure that the provided Taxonium file corresponds to the same MAT used for WEPP.

Run Command

WEPP's snakemake workflow requires DIR and FILE_PREFIX as config arguments through the command line, while the remaining ones can be taken from the config file. It also requires --cores from the command line, which specifies the number of threads used by the workflow.

Examples:

  1. Using all the parameters from the config file.

    snakemake --config DIR=SARS_COV_2_real FILE_PREFIX=test_run TREE=sars_cov_2_mat.pb.gz REF=sars_cov_2_reference.fa --cores 32 --use-conda
    

  2. Overriding MIN_Q and CLADE_IDX through command line.

    snakemake --config DIR=SARS_COV_2_real FILE_PREFIX=test_run TREE=sars_cov_2_mat.pb.gz REF=sars_cov_2_reference.fa MIN_Q=25 CLADE_IDX=1 --cores 32 --use-conda
    

  3. To visualize results from a previous WEPP analysis that was run without the dashboard, set DASHBOARD_ENABLED to True and re-run only the dashboard components, without reanalyzing the dataset.

    snakemake --config DIR=SARS_COV_2_real FILE_PREFIX=test_run TREE=sars_cov_2_mat.pb.gz REF=sars_cov_2_reference.fa MIN_Q=25 CLADE_IDX=1 DASHBOARD_ENABLED=True --cores 32 --use-conda --forcerun dashboard_serve
    

Note

⚠️ Use the same configuration parameters (DIR, FILE_PREFIX, etc.) as were used for the specific project. This ensures the dashboard serves the correct results for your chosen dataset. ⚠️ Make sure port forwarding is enabled when running on external servers to view results on your personal machine.

MAT Download

Mutation-annotated trees (MAT) for different pathogens are maintained by the UShER team, which can be found here. You can also create your own MAT for any pathogen from the consensus genome assemblies using viral_usher.