Research Interests

In general, I'm interested in machine learning as a tool to uncover novel signals in biology. I am equally interested in the design of these novel tools, and their exploration. Some of my work involves exploring the interpretability of genome language models to learn what signals are uncovered in un-supervised pre-training. Based on these findings, I'm also interested in designing biologically-motivated and downstream-task aligned unsupervised pre-training regimes to achieve more robust discovery of underlying biological signal.

Papers

2025 - Interpreting Attention Mechanisms in Genomic Transformer Models: A Framework for Biological Insights

bioRxiv

Transformer models have shown strong performance on biological sequence prediction tasks, but the interpretability of their internal mechanisms remains underexplored. Given their application in biomedical research, understanding the mechanisms behind these models’ predictions is crucial for their widespread adoption. We introduce a method to interpret attention heads in genomic transformers by correlating per-token attention scores with curated biological annotations, and we use GPT-4 to summarize each head’s focus. Applying this to DNABERT, Nucleotide Transformer, and scGPT, we find that attention heads learn biologically meaningful associations during self-supervised pre-training and that these associations shift with fine-tuning. We show that interpretability varies with tokenization scheme, and that context-dependence plays a key role in head behaviour. Through ablation, we demonstrate that heads strongly associated with biological features are more important for task performance than uninformative heads in the same layers. In DNABERT trained for TATA promoter prediction, we observe heads with positive and negative associations reflecting positive and negative learning dynamics. Our results offer a framework to trace how biological features are learned from random initialization to pre-training to fine-tuning, enabling insight into how genomic foundation models represent nucleotides, genes, and cells.
2025 - Transformers and genome language models

Nature Machine Intelligence

Large language models based on the transformer deep learning architecture have revolutionized natural language processing. Motivated by the analogy between human language and the genome’s biological code, researchers have begun to develop genome language models (gLMs) based on transformers and related architectures. This Review explores the use of transformers and language models in genomics. We survey open questions in genomics amenable to the use of gLMs, and motivate the use of gLMs and the transformer architecture for these problems. We discuss the potential of gLMs for modelling the genome using unsupervised pretraining tasks, specifically focusing on the power of zero- and few-shot learning. We explore the strengths and limitations of the transformer architecture, as well as the strengths and limitations of current gLMs more broadly. Additionally, we contemplate the future of genomic modelling beyond the transformer architecture, based on current trends in research. This Review serves as a guide for computational biologists and computer scientists interested in transformers and language models for genomic data.
2023 - Transforming Genomic Interpretability: A DNABERT Case Study

ICML 2023 Comp Bio Workshop Paper

While deep learning algorithms, particularly transformers, have recently shown significant promise in making predictions from biological sequences, their interpretability in the context of biology has not been deeply explored. This paper focuses on the recently proposed DNABERT model and explores interpreting it’s decisions using modified Layer-wise Relevance Propagation (LRP) methods to determine what the model is learning. This score is then compared to several other interpretability methods commonly applied to transformers, including the attention-score based method proposed by the DNABERT authors. Results of mutagenesis experiments targeting regions identified by different methods show the modified LRP interpretability scores can outperform others at 20 mutations, and also show attention cannot reliably outperform random scores.
2022 - Bulk and single-nucleus transcriptomics highlight intra-telencephalic and somatostatin neurons in Alzheimer’s disease

Frontiers in Molecular Neuroscience

Cortical neuron loss is a pathological hallmark of late-onset Alzheimer’s disease (AD). However, it remains unclear which neuronal subtypes beyond broad excitatory and inhibitory classes are most vulnerable. Here, we analyzed cell subtype proportion differences in AD compared to non-AD controls using 1037 post-mortem brain samples from six neocortical regions. We identified the strongest associations of AD with fewer somatostatin (SST) inhibitory neurons and intra-telencephalic (IT) excitatory neurons. Replication in three AD case-control single-nucleus RNAseq datasets most strongly supported the bulk tissue association of fewer SST neurons in AD. In depth analyses of cell type proportions with specific AD-related neuropathological and cognitive phenotypes revealed fewer SST neurons with greater brain-wide post-mortem tau and beta amyloid, as well as a faster rate of antemortem cognitive decline. In contrast, greater IT neuron proportions were associated with a slower rate of cognitive decline as well as greater residual cognition–a measure of cognitive resilience–but not canonical AD neuropathology. Our findings implicate somatostatin inhibitory and intra-telencephalic excitatory neuron subclasses in the pathogenesis of AD and in cognitive resilience to AD pathology, respectively.

Invited Talks

2025 - Genome Language Models at Google Genomics

Watch here
2024 - Genome Language Models at Harvard

Watch here
2023 - Transforming Genomic Interpretability at T-CAIREM Trainee Rounds

Watch here

Research Interests

Papers

2025 - Interpreting Attention Mechanisms in Genomic Transformer Models: A Framework for Biological Insights

2025 - Transformers and genome language models

2023 - Transforming Genomic Interpretability: A DNABERT Case Study

2022 - Bulk and single-nucleus transcriptomics highlight intra-telencephalic and somatostatin neurons in Alzheimer’s disease

Invited Talks

2025 - Genome Language Models at Google Genomics

2024 - Genome Language Models at Harvard

2023 - Transforming Genomic Interpretability at T-CAIREM Trainee Rounds