Abstract:
Precise chromatin folding is imperative in the modulation and specialization of regulatory function in various contexts with marked roles and implications in cellular differentiation and disease. Topologically Associating Domains (TADs) are structural features conserved across species and cell types, typically tens of kb to a few Mb, con- straining regulatory interactions within the domains. The genomic regions located between consecutive TADs are highly conserved and have been shown to be vital in the stable formation of the domains. CTCF is a zinc finger binding transcriptional factor and a critical determinant of 3D chromatin architecture and demarcation of TADs and their boundary regions. Here, we discover various sequence variations in the CTCF TFBS that are important for the formation and stabilization of these struc- tural domains. We delineate relationships between nucleotide sequence variations in the motifs, its genomic neighbourhood, and their positional relevance in the context of TADs. We do so by probing the complex relationships between DNA sequence iden- tity, conservation across species, and the binding efficiency of the CTCF protein spe- cific to these variations. Furthermore, by integrating epigenomic signals, ChIP bind- ing intensity, and evolutionary conservation data into a convolutional neural network model, we accurately predict insulation scores and extend these predictions across dif- ferent cell lines. This integrative approach studies the role of CTCF in orchestrating chromatin organization in the cellular context and through differentiation.