Digital Repository

Mixture modelling to characterize diversity in DNA regions

Show simple item record

dc.contributor.advisor NARLIKAR, LEELAVATI en_US
dc.contributor.author BISWAS, ANUSHUA en_US
dc.date.accessioned 2023-06-20T09:03:08Z
dc.date.available 2023-06-20T09:03:08Z
dc.date.issued 2023-05 en_US
dc.identifier.citation 197 en_US
dc.identifier.uri http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/8028
dc.description.abstract The degrees of expression of the hundreds of genes in a eukaryotic cell influence its phenotype and functioning. The binding of proteins such as transcription factors to particular regulatory regions on the DNA play an important role in this process of regulation of gene expression. Mutations in these regulatory regions can affect the gene expression and can often lead to misregulation resulting in disorders and diseases. Therefore, to profile a wide range of regulation related biochemical activities, a variety of high-throughput experimental assays have been designed. They give a genome wide map of the regions having certain common characteristics for which they have been profiled. Some of the examples include STARR-seq which recognizes active enhancers, ATAC-seq, that detects accessible chromatin, ChIP-seq, which is used to identify TF binding sites. These assays report regions that are 200 to 1000 bases long, although the functional elements present in these regions are ≈ 15 bases in length. Current computational algorithms look for a common characteristic within these reported regions to identify these short sequence signatures. Evidence, however, suggests that these regions reported by the experiments have considerable heterogeneity in them. In fact, while these methods can pick up on the stronger signals, they can easily miss out on the weaker or less frequent ones. In order to explicitly characterize the heterogeneity in these regions, I considered the question as a mixture modelling problem. Our first method, DIVERSITY, was developed to cluster regions from ChIP-seq experiment into groups while simultaneously learning sequence signatures specific to each group in a de novo manner. DIVERSITY provides novel insights into the different ways in which a protein can bind DNA, including co-operative binding with other proteins. We next looked at regions identified by exonuclease-based ChIP experiments. They measure the exonuclease activity very close to the actual protein binding sites with high precision, characterizing those regions with sharp read profile distributions. Our next method ExoDiversity models these regions by learning a joint probability distribution over the distinct ChIP-exo read signals in the forward and the reverse strands and the DNA sequences. It could resolve the binding footprints of the profiled TFs at nucleotide level resolution. These differences correlated with distinct DNA structure properties and sequence conservation profiles, implying that they are likely to have some functional importance. Finally, we investigated the varied mechanisms of TF binding at the regulatory regions. The combinatorial control exhibited by the TFs at these were modelled by our method cisDiversity. Without requiring any prior knowledge of TF, cell type or organism cisDiversity could determine discrete regulatory modules with particular combinations of TF binding sites when it was applied to multiple datasets across diverse types of species. We believe that our model-based approach of explaining the data in terms of various sequence components provides a comprehensive understanding of the regulatory information encoded in the data. en_US
dc.language.iso en en_US
dc.subject mixture modelling en_US
dc.subject motif discovery en_US
dc.subject high throughput sequencing en_US
dc.subject ChIP seq en_US
dc.subject Gibbs sampling en_US
dc.subject transcription factor binding en_US
dc.subject regulatory regions en_US
dc.subject sequence conservation en_US
dc.title Mixture modelling to characterize diversity in DNA regions en_US
dc.type Thesis en_US
dc.description.embargo No Embargo en_US
dc.type.degree Ph.D en_US
dc.contributor.department Dept. of Data Science en_US
dc.contributor.registration 20213301 en_US


Files in this item

This item appears in the following Collection(s)

  • PhD THESES [583]
    Thesis submitted to IISER Pune in partial fulfilment of the requirements for the degree of Doctor of Philosophy

Show simple item record

Search Repository


Advanced Search

Browse

My Account