Mixture modelling to characterize diversity in DNA regions

BISWAS, ANUSHUA

Please use this identifier to cite or link to this item: http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/8028

Title:	Mixture modelling to characterize diversity in DNA regions
Authors:	NARLIKAR, LEELAVATI BISWAS, ANUSHUA Dept. of Data Science 20213301
Keywords:	mixture modelling motif discovery high throughput sequencing ChIP seq Gibbs sampling transcription factor binding regulatory regions sequence conservation
Issue Date:	May-2023
Citation:	197
Abstract:	The degrees of expression of the hundreds of genes in a eukaryotic cell influence its phenotype and functioning. The binding of proteins such as transcription factors to particular regulatory regions on the DNA play an important role in this process of regulation of gene expression. Mutations in these regulatory regions can affect the gene expression and can often lead to misregulation resulting in disorders and diseases. Therefore, to profile a wide range of regulation related biochemical activities, a variety of high-throughput experimental assays have been designed. They give a genome wide map of the regions having certain common characteristics for which they have been profiled. Some of the examples include STARR-seq which recognizes active enhancers, ATAC-seq, that detects accessible chromatin, ChIP-seq, which is used to identify TF binding sites. These assays report regions that are 200 to 1000 bases long, although the functional elements present in these regions are ≈ 15 bases in length. Current computational algorithms look for a common characteristic within these reported regions to identify these short sequence signatures. Evidence, however, suggests that these regions reported by the experiments have considerable heterogeneity in them. In fact, while these methods can pick up on the stronger signals, they can easily miss out on the weaker or less frequent ones. In order to explicitly characterize the heterogeneity in these regions, I considered the question as a mixture modelling problem. Our first method, DIVERSITY, was developed to cluster regions from ChIP-seq experiment into groups while simultaneously learning sequence signatures specific to each group in a de novo manner. DIVERSITY provides novel insights into the different ways in which a protein can bind DNA, including co-operative binding with other proteins. We next looked at regions identified by exonuclease-based ChIP experiments. They measure the exonuclease activity very close to the actual protein binding sites with high precision, characterizing those regions with sharp read profile distributions. Our next method ExoDiversity models these regions by learning a joint probability distribution over the distinct ChIP-exo read signals in the forward and the reverse strands and the DNA sequences. It could resolve the binding footprints of the profiled TFs at nucleotide level resolution. These differences correlated with distinct DNA structure properties and sequence conservation profiles, implying that they are likely to have some functional importance. Finally, we investigated the varied mechanisms of TF binding at the regulatory regions. The combinatorial control exhibited by the TFs at these were modelled by our method cisDiversity. Without requiring any prior knowledge of TF, cell type or organism cisDiversity could determine discrete regulatory modules with particular combinations of TF binding sites when it was applied to multiple datasets across diverse types of species. We believe that our model-based approach of explaining the data in terms of various sequence components provides a comprehensive understanding of the regulatory information encoded in the data.
URI:	http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/8028
Appears in Collections:	PhD THESES

Files in This Item:

File	Description	Size	Format
20213301_Anushua_Biswas_PhD_Thesis.pdf	PhD thesis	56.4 MB	Adobe PDF	View/Open

Show full item record