Projects

Welcome to Turakhia Lab research page. We are an interdisciplinary group of researchers working on a broad range of problems at the cutting-edge intersection of computer engineering and bioinformatics. We like to work on problems that have enormous potential in biological and medical applications but where computational costs and speed impose a barrier. Specifically, our research mission is to develop algorithms and domain-specific hardware accelerators that enable faster and cheaper progress in biology and medicine.

Some of the ongoing research areas and projects in the lab are highlighted below. Please also see the publications and the news page for latest developments and the apply page to apply to join our group.

Hardware acceleration of computational genomics

Genomic data is one of the fastest-growing data types on the planet and is far outpacing Moore’s law. From personalized medicine to species conservation, genomic data has far-reaching applications, but computational costs are posing ever greater challenges to exploit the full potential of this data. We are constantly exploring novel algorithms and hardware (GPU/FPGA/ASIC) acceleration approaches to speed up a wide range of computational genomics tasks, such as genome assembly, read alignment, and whole-genome alignments, by orders of magnitude.

Tiling Long Genome Sequence Alignment using Convergence of Traceback Pointers (TALCO): Pairwise sequence alignment is one of the most fundamental and computationally intensive steps in genome analysis. With the improving costs and throughput of third-generation sequencing technologies and the growing availability of whole-genome datasets, longer alignments are becoming more common in the field of bioinformatics. However, the high memory demands of long alignments create significant obstacles to hardware acceleration. Tiling-based hardware accelerators have made remarkable strides in accelerating sequence alignment, achieving three to four orders of magnitude improvement in alignment throughput over software tools without any restrictions on alignment length. However, existing tiling heuristics can cause the alignment quality to degrade, which is a critical concern for the wider adoption of accelerators in the field of bioinformatics. To address this issue, we devise TALCO – a novel method for tiling long sequence alignments, that, similar to prior tiling techniques, maintains a constant memory footprint during the acceleration step independent of alignment length. However, unlike previous techniques, TALCO also ensures optimal alignments under banding constraints. TALCO does this by leveraging the convergence of traceback paths beyond a tile to a single point on the boundary of that tile – a strategy that generalizes well to a broad set of sequence alignment algorithms. To demonstrate that, we applied TALCO to widely-used banded sequence alignment algorithms, X-Drop and WFA-Adapt.

Pangenomics

Low-cost and high-throughput sequencing have empowered the biomedical research community to sequence vast amounts of genomes. This is catalyzing the growth of an emerging area of research called Pangenomics which aims to study the entire genetic variation within a species using a large collection of reference genomes. This research will impact many fields (human genetics, cancer research, microbiology, epidemiology, crop genomics, etc.) but is also a radical shift from traditional approaches, which relied on the use of a single reference genome. As a result, Pangenomics requires a fundamental redesign of our existing sequence analysis tools. Our team is focused on developing data structures and algorithms for comprehensive pangenomic analysis, as well as hardware acceleration to enable faster processing of the data.

Phylogenetics

Phylogenetics is the study of the evolutionary relationships between a set of biological sequences, organisms, or species. This has many downstream biological and medical applications – from evolutionary biology to outbreak analysis to understanding the progression of cancer. Large phylogenies are computationally challenging to estimate, and with the recent advancements in genomic sequencing technologies generating a massive amount of genomic data, there is an urgent need to automate this process. We are to exploring new phylogenetic methods and acceleration techniques to cater to a wide range of applications.

Accurate, Fast, and Fully-automated Inference of Species Trees: Extensive genome sequencing efforts are currently in progress across a broad spectrum of life forms, holding the potential to unravel the intricate branching patterns within the tree of life. However, constructing phylogenetic trees is computationally expensive, with the possible number of tree topologies for a mere 50+ species surpassing the number of atoms in the universe. Current state-of-the-art methods apply heuristics to explore a tiny fraction of this huge search space but reach its computational limit soon for larger datasets, taking several CPU weeks for thousands of species. Additionally, they require a series of manual steps that are neither entirely automated nor standardized. We, along with Prof. Siavash Mirarab’s lab at UCSD, are building a fully automated, scalable, and robust hardware-software co-designed solution for accelerating phylogenetic tree construction for all kinds of genomes without requiring any hand-curated prior information and providing end-to-end pipeline from raw genomic sequences to its phylogenetic tree, facilitating researchers to explore thousands of genomes effortlessly with the minimum processing time.

Real-time Phylogenetics for Pandemic and Outbreak Analysis

One specific application of phylogenetics that we, with Prof. Corbett-Detig’s lab at UC Santa Cruz, have pioneered is for pandemic and outbreak analysis. Much of this research began for the SARS-CoV-2 analysis during the COVID-19 pandemic. Our work helped maintain the comprehensive SARS-CoV-2 phylogenetic tree using global sequencing data and forms the basis of naming new SARS-CoV-2 variants. We are now generalizing our methods for a wide range of pathogens and other microbes.

Ultrafast Sample Placement on Existing TRees (UShER): With over 16 million (and counting) whole SARS-CoV-2 genomes sequences already, the COVID-19 pathogen is the most sequenced pathogen in history. Phylogenetic analysis using these genomes has played a vital role in tracking the virus evolution and transmission, but is posing major computational challenges. Our lab is working closely with Prof. Corbett-Detig’s lab at UC Santa Cruz in maintaining and refining a comprehensive phylogenetic tree, consisting of all available SARS-CoV-2 genomes. Building and updating phylogenetic trees of such massive scale was made possible by UShER, which builds an efficient tree-based data structure, encoding the inferred evolutionary history of the virus. [Paper] [GitHub] [Wiki]

matUtils: The vast scale of SARS-CoV-2 sequencing data also created challenges in analyzing the available data using the existing tools and file formats. MatUtils is a command-line utility for rapidly querying, interpreting, and manipulating UShER-generated mutation-annotated trees (MATs), drastically lowering the barrier to entry for SARS-CoV-2 data analysis and research. [Paper]

matOptimize: In order to efficiently maintain and optimize the massive phylogenies enabled by UShER in real-time, matOptimize is a fast and memory-efficient phylogenetic tree optimization tool for parsimony-based MATs. MatOptimize achieves orders of magnitude improvement in runtime and peak memory usage compared to the existing state-of-the-art methods. Online phylogenetics with UShER and matOptimize produces similar trees to de novo maximum likelihood-based approaches, while requiring only a fraction of the computational cost to construct.[Paper]

SARS-CoV-2 Recombination Inference: Accurate and timely detection of recombinant lineages is crucial for interpreting genetic variation, reconstructing epidemic spread, identifying selection and variants of interest, and accurately performing phylogenetic analyses. Previous recombination detection software failed to scale to SARS-CoV-2 dataset – so we developed RIPPLES – a new phylogenomic method capable of searching comprehensive SARS-CoV-2 phylogenies for recombinant lineages within hours. In July 2021, we carried out the largest SARS-CoV-2 recombination study to reveal that more than 2.7% of sequenced SARS-CoV-2 genomes contain recombinant ancestry, with a disproportionate number of breakpoint intervals inferred in the spike protein region of the genome. Furthermore, building on the RIPPLES algorithm, we developed RIVET – a fully automated pipeline for the inference, ranking and visualization of the latest putative SARS-CoV-2 recombinants. This was to greatly accelerate the laborious and manual process of Pangolin recombinant lineage designation. RIVET is updated on a weekly basis and is being generalized for other pathogens.[RIPPLES] [RIVET]

Wastewater-based Epidemiology

SARS-CoV-2 pandemic was the most devastating infectious outbreak of the last century that caused 700 million reported infections and 7 million deaths. In the early days of the pandemic, clinical genomic surveillance helped discover new variants and kept track of their transmission across regions. However, tracking pathogen evolution and transmission through clinical genomic surveillance suffers from sampling bias, becomes costly and infeasible to scale, and brings challenges to data privacy. Wastewater-based epidemiology (WBE) is an emerging alternative where the sewage data is analyzed for the presence of genomic sequences of disease-causing pathogens and their specific variants. A UCSD team showed earlier that WBE not only mitigates the problems associated with sampling bias, cost, scalability, and privacy but can even detect emerging variants up to two weeks in advance. Today, WBE is being rapidly deployed across the world to monitor a wide range of pathogens (SARS-CoV-2, RSV, Monkeypox, Parainfluenza, Polio, etc.).With so much WBE data, there comes computational and algorithmic challenges. We are currently working on new algorithms and software tools that aim to improve the accuracy, resolution, and timeliness of existing WBE solutions. By doing so, we aim to contribute to a more effective and comprehensive strategy for monitoring and managing infectious disease outbreaks for a wide range of pathogens.