Pangenome-based genome inference using long read sequencing data

PANI, SAMARENDRA

Please use this identifier to cite or link to this item: http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/6844

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Marschall, Tobias	en_US
dc.contributor.author	PANI, SAMARENDRA	en_US
dc.date.accessioned	2022-05-11T09:55:34Z	-
dc.date.available	2022-05-11T09:55:34Z	-
dc.date.issued	2022-05	-
dc.identifier.citation	91	en_US
dc.identifier.uri	http://dr.iiserpune.ac.in:8080/xmlui/handle/123456789/6844	-
dc.description.abstract	Long read sequence data provide a unique opportunity for genotyping by covering structural variants and providing linkage information. The linkage information is crucial for phasing diploid organisms like humans since it provides information on which alleles lie of the same haplotype. The existing methods which use long reads to genotype rely on a single linear reference genome. They suffer from “reference bias”, the inability to analyse samples that contain alleles not defined in the reference. Pangenome reference models rectify that by producing genome graphs that consist of allele information from multiple individuals/populations. We have developed an HMM that genotypes variants using noisy long read data and a pangenome reference. This is the first method to use both long reads and a pangenome reference and has great potential for application due to the increasing availability of long-read data and high-quality genome samples which can be utilised in the pangenome. We benchmark the model against the tool PanGenie, which genotypes variants using short-read data and a pangenome reference at low read coverage values. The model outperforms PanGenie at genotyping single nucleotide polymorphism and small indels but suffers for indels of size more than 20bp. We create case studies to identify possible issues with the model by visualising the data at incorrectly genotyped positions. Runtime analysis of the model shows that the model is not tractable for high coverages, and parallel processing is required to process the entire genome.	en_US
dc.language.iso	en	en_US
dc.subject	Pangenomics	en_US
dc.subject	Computational Biology	en_US
dc.subject	Genomics	en_US
dc.subject	Methods Development	en_US
dc.subject	Genotyping	en_US
dc.subject	Structural Variations	en_US
dc.title	Pangenome-based genome inference using long read sequencing data	en_US
dc.type	Thesis	en_US
dc.type.degree	BS-MS	en_US
dc.contributor.department	Dept. of Biology	en_US
dc.contributor.registration	20171095	en_US
Appears in Collections:	MS THESES

Files in This Item:

File	Description	Size	Format
20171095_Thesis.pdf		6.25 MB	Adobe PDF	View/Open Request a copy

Show simple item record