Abstract:
Genomic data and other omics data have been used for the prediction of disease phenotype in precision medicine for a long time. In recent years, many such prediction models have been built using Machine Learning (ML) algorithms. As of today, Genomic data and other biomedical data suffer from sampling bias in terms of peoples' ethnicity, as most data comes from people of European ancestry. A smaller sample size for other population groups causes suboptimal results in ML-based prediction models for those populations. As data collection for those populations is time-consuming and costly, we developed Deep Learning-based models for in-silico data enhancement. We propose Offspring- Generative Adversarial Network (Offspring GAN), which is trained on heavily biased real data to generate realistic data and augment to existing biased real data to alleviate biases and disparities in real data. Contrary to traditional conditional GANs, Offspring GAN consists of four players, one generator, one discriminator and two novel F1 generators. We evaluated the data fidelity and variation of synthetic data using principal component analysis, correlation matrix, and comparison of Gaussian components. Our results showed Offspring GAN's ability to mitigate mode collapse problems and generate realistic data of good variation even when trained on biased data.