Abstract:
In the past 2 decades of literature dealing with modeling complex systems, there has been a balance, or rather, a tension between the predictive power and the interpretability of machine learning models using vast amounts of data. Biological complex systems are no different. The past decade has seen an astonishing increase in the amount of publicly available functional genomics data. While the adoption of deep learning techniques to determine the sequence patterns, syntax and grammar in DNA sequence elements that govern gene regulatory activity has been a natural consequence, most of these investigations have adopted a ‘black box’ approach, with model predictions that are hard to interpret mechanistically. Multiple attribution strategies, which seek to extract meaningful post-hoc interpretations from neural networks have been proposed for addressing this problem. However, there remains a substantial gap in the literature between the outputs of such post-hoc methods and fully mechanistic models, specifically in the context of gene regulation. This problem can be at least partially overcome by including some level of mechanistic detail in the internal structure of deep learning algorithms. This can enable us to better understand the predictions of the model to obtain mechanistic insight. Here, we use a cell-state specific Massively Paraellel Reporter Assay dataset from hemotopoeitic stem cells to model gene regulation using deep learning to predict transcription factor (TF) binding on DNA sequence employing cell-state specific Chip-Seq data and graph-based representations of markov processes to model effects of bound TFs on different rate-limiting steps in the transcriptional cycle. Our model assumptions are grounded in recent biophysical findings in literature.