dc.description.abstract |
The pursuit of understanding the complex interplay between the biological sequences - DNA, RNA and proteins - has driven biological research for many years. The interaction between RNA and RNA-binding proteins (RBPs) is integral to RNA function and cellular regulation. At least 1,500 of the over 20,000 annotated human proteins are predicted to bind RNA. RBPs usually recognize RNA targets through a common local preference or sequence - referred to as a ’motifs’ - which facilitate RNA-protein interactions. Recent experimental high-throughput approaches such as iCLIP, PAR-CLIP or eCLIP enable one to profile binding sites of a given RBP transcriptome-wide, thus providing insight into the RBPs binding preferences. While an abundance of studies has explored the sequence binding motifs of RBPs via deep learning methods, less is known about the RNA secondary structure preferences of RBPs. This project aims to explore the secondary structure binding preferences for a large set of RBPs using CLIP-seq datasets from publicly available ENCODE database. We develop a deep learning model which incorporates sequence and structure information to outperform the sequence-only baseline. Finally, we use model interpretation and feature attribution techniques to quantify the relative importance of sequence and secondary structure information for each RBP, thus identifying the primary binding modes of different RBPs. Through our explorations, we realised that in vivo RNA structure data has low coverage on the transcriptome, limiting the amount of information we can extract about structural binding motifs. Finally, we investigate how the choice of negative samples can impact downstream model performance. We find that computationally determined RNA secondary structure provides no new information to the model and that experimentally derived RNA secondary structure improves performance of models under certain negative sampling conditions. |
en_US |