Abstract:
Individual-level records from health and social services are routinely being generated, collected and maintained centrally in nation-wide registers. These records, when combined with a cohort study/survey, may increase the statistical power of association between the outcome and risk factors. The biggest challenge in combining data from the population and the survey is missing risk factor data. Methods to handle missing data within the survey are well developed and widely used. Multiple Imputation (MI) is one such widely used method of handling missing data.
MI is popular because it avoids the potential bias and efficiency loss resulting from a complete-case analysis (CCA). This thesis studies how MI handles missing data in comparison with CCA for different types of covariates such as continuous and categorical covariates, for time-to-event and binary outcomes data. It also discusses the ways to include the time-to-event data in the presence of right censoring and delayed entry in the imputation model. Furthermore, an empirical study has conducted on the population-level ischemic heart disease event data provided by the Finnish Institute for Health and Welfare (THL), that contains missing data in the selected risk factors. The overall results show that the MI method, with a sufficient number of imputation and iterations, is preferred in most scenarios