Mixture modelling to characterize diversity in DNA regions

BISWAS, ANUSHUA

Please use this identifier to cite or link to this item: http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/8028

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	NARLIKAR, LEELAVATI	en_US
dc.contributor.author	BISWAS, ANUSHUA	en_US
dc.date.accessioned	2023-06-20T09:03:08Z	-
dc.date.available	2023-06-20T09:03:08Z	-
dc.date.issued	2023-05	en_US
dc.identifier.citation	197	en_US
dc.identifier.uri	http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/8028	-
dc.description.abstract	The degrees of expression of the hundreds of genes in a eukaryotic cell influence its phenotype and functioning. The binding of proteins such as transcription factors to particular regulatory regions on the DNA play an important role in this process of regulation of gene expression. Mutations in these regulatory regions can affect the gene expression and can often lead to misregulation resulting in disorders and diseases. Therefore, to profile a wide range of regulation related biochemical activities, a variety of high-throughput experimental assays have been designed. They give a genome wide map of the regions having certain common characteristics for which they have been profiled. Some of the examples include STARR-seq which recognizes active enhancers, ATAC-seq, that detects accessible chromatin, ChIP-seq, which is used to identify TF binding sites. These assays report regions that are 200 to 1000 bases long, although the functional elements present in these regions are ≈ 15 bases in length. Current computational algorithms look for a common characteristic within these reported regions to identify these short sequence signatures. Evidence, however, suggests that these regions reported by the experiments have considerable heterogeneity in them. In fact, while these methods can pick up on the stronger signals, they can easily miss out on the weaker or less frequent ones. In order to explicitly characterize the heterogeneity in these regions, I considered the question as a mixture modelling problem. Our first method, DIVERSITY, was developed to cluster regions from ChIP-seq experiment into groups while simultaneously learning sequence signatures specific to each group in a de novo manner. DIVERSITY provides novel insights into the different ways in which a protein can bind DNA, including co-operative binding with other proteins. We next looked at regions identified by exonuclease-based ChIP experiments. They measure the exonuclease activity very close to the actual protein binding sites with high precision, characterizing those regions with sharp read profile distributions. Our next method ExoDiversity models these regions by learning a joint probability distribution over the distinct ChIP-exo read signals in the forward and the reverse strands and the DNA sequences. It could resolve the binding footprints of the profiled TFs at nucleotide level resolution. These differences correlated with distinct DNA structure properties and sequence conservation profiles, implying that they are likely to have some functional importance. Finally, we investigated the varied mechanisms of TF binding at the regulatory regions. The combinatorial control exhibited by the TFs at these were modelled by our method cisDiversity. Without requiring any prior knowledge of TF, cell type or organism cisDiversity could determine discrete regulatory modules with particular combinations of TF binding sites when it was applied to multiple datasets across diverse types of species. We believe that our model-based approach of explaining the data in terms of various sequence components provides a comprehensive understanding of the regulatory information encoded in the data.	en_US
dc.language.iso	en	en_US
dc.subject	mixture modelling	en_US
dc.subject	motif discovery	en_US
dc.subject	high throughput sequencing	en_US
dc.subject	ChIP seq	en_US
dc.subject	Gibbs sampling	en_US
dc.subject	transcription factor binding	en_US
dc.subject	regulatory regions	en_US
dc.subject	sequence conservation	en_US
dc.title	Mixture modelling to characterize diversity in DNA regions	en_US
dc.type	Thesis	en_US
dc.description.embargo	No Embargo	en_US
dc.type.degree	Ph.D	en_US
dc.contributor.department	Dept. of Data Science	en_US
dc.contributor.registration	20213301	en_US
Appears in Collections:	PhD THESES

Files in This Item:

File	Description	Size	Format
20213301_Anushua_Biswas_PhD_Thesis.pdf	PhD thesis	56.4 MB	Adobe PDF	View/Open

Show simple item record