Abstract:
Genomic data have been used for trait association and disease risk prediction for a long time. In recent years, many such prediction models are built using machine learning (ML) algorithms. As of today, human genomic data and other biomedical data suffer from sampling biases in terms of people's ethnicity, as most of the data come from people of European ancestry. Smaller sample sizes for other population groups can cause suboptimal results in ML-based prediction models for those populations. Suboptimal predictions in precision medicine for some particular group can cause serious consequences limiting the model's applicability in real-world problems. As data collection for those populations is time-consuming and costly, we suggest deep learning-based models for in-silico data enhancement. Existing Generative Adversarial Network (GAN) models for genomic data like Population scale Genomic conditional-GAN (PG-cGAN) can generate realistic genomic data while trained on fairly unbiased data but fails while trained on biased data and encounters severe mode collapse. Our proposed model, Offspring GAN, can resolve the mode collapse issue even when trained in strongly biased genomic datasets. Our results demonstrate the ability of Offspring GAN to generate realistic and diverse label-aware data, which can augment limited real data to alleviate biases and disparities in genomic data. We also propose a privacy-preserving protocol using Offspring GAN to protect the privacy of genomic data.