Research Interests
In general, I'm interested in machine learning as a tool to uncover novel signals in biology. I am equally interested in the design of these novel tools, and their exploration. Some of my work involves exploring the interpretability of genome language models to learn what signals are uncovered in un-supervised pre-training. Based on these findings, I'm also interested in designing biologically-motivated and downstream-task aligned unsupervised pre-training regimes to achieve more robust discovery of underlying biological signal.
Papers
-
2025 - Transformers and genome language models
Large language models based on the transformer deep learning architecture have revolutionized natural language processing. Motivated by the analogy between human language and the genome’s biological code, researchers have begun to develop genome language models (gLMs) based on transformers and related architectures. This Review explores the use of transformers and language models in genomics. We survey open questions in genomics amenable to the use of gLMs, and motivate the use of gLMs and the transformer architecture for these problems. We discuss the potential of gLMs for modelling the genome using unsupervised pretraining tasks, specifically focusing on the power of zero- and few-shot learning. We explore the strengths and limitations of the transformer architecture, as well as the strengths and limitations of current gLMs more broadly. Additionally, we contemplate the future of genomic modelling beyond the transformer architecture, based on current trends in research. This Review serves as a guide for computational biologists and computer scientists interested in transformers and language models for genomic data.
-
2023 - Transforming Genomic Interpretability: A DNABERT Case Study
ICML 2023 Comp Bio Workshop Paper
While deep learning algorithms, particularly transformers, have recently shown significant promise in making predictions from biological sequences, their interpretability in the context of biology has not been deeply explored. This paper focuses on the recently proposed DNABERT model and explores interpreting it’s decisions using modified Layer-wise Relevance Propagation (LRP) methods to determine what the model is learning. This score is then compared to several other interpretability methods commonly applied to transformers, including the attention-score based method proposed by the DNABERT authors. Results of mutagenesis experiments targeting regions identified by different methods show the modified LRP interpretability scores can outperform others at 20 mutations, and also show attention cannot reliably outperform random scores.
-
2022 - Bulk and single-nucleus transcriptomics highlight intra-telencephalic and somatostatin neurons in Alzheimer’s disease
Frontiers in Molecular Neuroscience
Cortical neuron loss is a pathological hallmark of late-onset Alzheimer’s disease (AD). However, it remains unclear which neuronal subtypes beyond broad excitatory and inhibitory classes are most vulnerable. Here, we analyzed cell subtype proportion differences in AD compared to non-AD controls using 1037 post-mortem brain samples from six neocortical regions. We identified the strongest associations of AD with fewer somatostatin (SST) inhibitory neurons and intra-telencephalic (IT) excitatory neurons. Replication in three AD case-control single-nucleus RNAseq datasets most strongly supported the bulk tissue association of fewer SST neurons in AD. In depth analyses of cell type proportions with specific AD-related neuropathological and cognitive phenotypes revealed fewer SST neurons with greater brain-wide post-mortem tau and beta amyloid, as well as a faster rate of antemortem cognitive decline. In contrast, greater IT neuron proportions were associated with a slower rate of cognitive decline as well as greater residual cognition–a measure of cognitive resilience–but not canonical AD neuropathology. Our findings implicate somatostatin inhibitory and intra-telencephalic excitatory neuron subclasses in the pathogenesis of AD and in cognitive resilience to AD pathology, respectively.
Invited Talks
-
2024 - Genome Language Models at Harvard
-
2023 - Transforming Genomic Interpretability at T-CAIREM Trainee Rounds