Research
One of the main questions I want to answer is: can we build self-supervised genomic models that encode interpretable biological structure? Or to put it another way: can we use self-supervised learning as a method for biological discovery?
My research involves interpreting what biological signals are captured in learned genomic representations, understanding why those signals are distributed and hard to fully disentangle, and using those findings to design pre-training objectives that build in biological structure from the start.
Publications
Predicting evolutionary rate as a pretraining task improves genome language model representations
Genome language models (gLM) have the potential to further understanding of regulatory genomics without requiring labeled data. Most gLMs are pretrained using sequence reconstruction tasks inspired by natural language processing, but recent studies have shown that these gLMs often fail to capture biological signal. To overcome this, we introduce pretraining tasks that predict the rate of evolution. These tasks are designed so that they can be composed with sequence reconstruction, enabling a controlled comparison of predicting sequence only, evolutionary rate only, or both. To address gaps in existing evaluations, we developed a suite of biologically grounded benchmarks. Across these tasks, and for established variant effect prediction benchmarks, models pretrained on both sequence and evolutionary rate outperform those trained on sequence alone, and training on evolutionary rate can make even the relatively small models in our work competitive with much larger existing gLMs for some tasks. These results establish evolution as a key training target for genome-scale models.
Interpreting Attention Mechanisms in Genomic Transformer Models: A Framework for Biological Insights
Transformer models have shown strong performance on biological sequence prediction tasks, but the interpretability of their internal mechanisms remains underexplored. Given their application in biomedical research, understanding the mechanisms behind these models' predictions is crucial for their widespread adoption. We introduce a method to interpret attention heads in genomic transformers by correlating per-token attention scores with curated biological annotations, and we use GPT-4 to summarize each head's focus. Applying this to DNABERT, Nucleotide Transformer, and scGPT, we find that attention heads learn biologically meaningful associations during self-supervised pre-training and that these associations shift with fine-tuning. We show that interpretability varies with tokenization scheme, and that context-dependence plays a key role in head behaviour. Through ablation, we demonstrate that heads strongly associated with biological features are more important for task performance than uninformative heads in the same layers. In DNABERT trained for TATA promoter prediction, we observe heads with positive and negative associations reflecting positive and negative learning dynamics. Our results offer a framework to trace how biological features are learned from random initialization to pre-training to fine-tuning, enabling insight into how genomic foundation models represent nucleotides, genes, and cells.
Transformers and genome language models
Large language models based on the transformer deep learning architecture have revolutionized natural language processing. Motivated by the analogy between human language and the genome's biological code, researchers have begun to develop genome language models (gLMs) based on transformers and related architectures. This Review explores the use of transformers and language models in genomics. We survey open questions in genomics amenable to the use of gLMs, and motivate the use of gLMs and the transformer architecture for these problems. We discuss the potential of gLMs for modelling the genome using unsupervised pretraining tasks, specifically focusing on the power of zero- and few-shot learning. We explore the strengths and limitations of the transformer architecture, as well as the strengths and limitations of current gLMs more broadly. Additionally, we contemplate the future of genomic modelling beyond the transformer architecture, based on current trends in research. This Review serves as a guide for computational biologists and computer scientists interested in transformers and language models for genomic data.
Transforming Genomic Interpretability: A DNABERT Case Study
While deep learning algorithms, particularly transformers, have recently shown significant promise in making predictions from biological sequences, their interpretability in the context of biology has not been deeply explored. This paper focuses on the recently proposed DNABERT model and explores interpreting its decisions using modified Layer-wise Relevance Propagation (LRP) methods to determine what the model is learning. This score is then compared to several other interpretability methods commonly applied to transformers, including the attention-score based method proposed by the DNABERT authors. Results of mutagenesis experiments targeting regions identified by different methods show the modified LRP interpretability scores can outperform others at 20 mutations, and also show attention cannot reliably outperform random scores.
Bulk and single-nucleus transcriptomics highlight intra-telencephalic and somatostatin neurons in Alzheimer's disease
Cortical neuron loss is a pathological hallmark of late-onset Alzheimer's disease (AD). However, it remains unclear which neuronal subtypes beyond broad excitatory and inhibitory classes are most vulnerable. Here, we analyzed cell subtype proportion differences in AD compared to non-AD controls using 1037 post-mortem brain samples from six neocortical regions. We identified the strongest associations of AD with fewer somatostatin (SST) inhibitory neurons and intra-telencephalic (IT) excitatory neurons. Replication in three AD case-control single-nucleus RNAseq datasets most strongly supported the bulk tissue association of fewer SST neurons in AD. In depth analyses of cell type proportions with specific AD-related neuropathological and cognitive phenotypes revealed fewer SST neurons with greater brain-wide post-mortem tau and beta amyloid, as well as a faster rate of antemortem cognitive decline. In contrast, greater IT neuron proportions were associated with a slower rate of cognitive decline as well as greater residual cognition — a measure of cognitive resilience — but not canonical AD neuropathology. Our findings implicate somatostatin inhibitory and intra-telencephalic excitatory neuron subclasses in the pathogenesis of AD and in cognitive resilience to AD pathology, respectively.