Genomics and Healthcare:


Machine Learning for Genomics:

We propose ENBED, a novel foundation model that analyzes DNA sequences at byte-level precision with an encoder-decoder Transformer architecture. The Ensemble Nucleotide Byte-level Encoder-Decoder (ENBED) uses a sub-quadratic implementation of attention to develop an efficient model capable of sequence-to-sequence transformations, generalizing previous genomic models with encoder-only or decoder-only architectures. We pre-train the foundation model using reference genome sequences and find that it outperforms the existing state-of-the-art in 22 out of 25 genomic benchmark datasets. Leveraging this strength in sequence-level classification tasks, we show that the model can identify biological function annotations of genomic sequences. Additionally, we show that ENBED can identify sequences consisting of base call mismatches and insertion/deletion errors, an advantage over tokenization schemes involving multiple base pairs, which lose the ability to analyze with byte-level precision. The novel genomic encoder-decoder architecture allows us to perform sequence-to-sequence transformations. We use this ability to study the prediction of pathogen mutations in 16S sequences from E. Coli and to accurately generate child sequences with known mutations validated in the real-world population.


Machine Learning for Sepsis Early Detection and Treatment:

Sepsis is a life-threatening medical emergency caused by your body’s overwhelming response to an infection. Without urgent treatment, it can lead to tissue damage, organ failure and death. Machine learning (ML) has been used to address the challenge of managing sepsis through sequential decision-making; however, these methods perform poorly in data-limited offline settings with survival rates falling below 50 percent. We propose a transformer-based decision maker, as well as integrate a mortality classifier as a reinforcement component to enhance the overall survival rate of patients.


Machine Learning for Health Risk Detection:

Using efficient learning based techniques are essential for predicting risk in individuals. In many cases, passive face videos can be used to predict the health risks. We have used learning based techniques for early prediction of Sepsis, prediction of lifting load risk, force exertions, and health monitoring.


DNA Based Data Storage:

DNA-based data storage systems have evolved as a solution to accommodate data explosion. In this work, some properties of DNA codewords that are essential for an archival DNA storage are considered for the design of codes. Constraint-based DNA codes, which avoid runs of nucleotides, have fixed GC-weight, and a specific minimum distance is presented. Further, we have provided a review on natural storage. We note that insertions and deletions are common errors in DNA storage, and efficient approaches to deal with such errors is also studied.


Home