Skip to main content

The impact of different imputation methods on estimates and model performance: an example using a risk prediction model for premature mortality

Abstract

Objective

To compare how different imputation methods affect the estimates and performance of a prediction model for premature mortality.

Study Design and Setting

Sex-specific Weibull accelerated failure time survival models were run on four separate datasets using complete case, mode, single and multiple imputation to impute missing values. Six performance measures were compared to access predictive accuracy (Nagelkerke R2, integrated brier score), discrimination (Harrell’s c-index, discrimination slope) and calibration (calibration in the large, calibration slope).

Results

The highest proportion of missingness for a single variable was 10.86% for the female model and 8.24% for the male model. Comparing the performance measures for complete case, mode, single and multiple imputation: the Nagelkerke R2 values for the female model was 0.1084, 0.1116, 0.1120 and 0.111–0.1120 with the male model exhibited similar variation of 0.1050, 0.1078, 0.1078 and 0.1078–0.1081. Harrell’s c-index also demonstrated small variation with values of 0.8666, 0.8719, 0.8719 and 0.8711–0.8719 for the female model and 0.8549, 0.8548, 0.8550 and 0.8550–0.8553 for the male model.

Conclusion

In the scenarios examined in this study, mode imputation performed well when using a population health survey compared to single and multiple imputation when predictive performance measures is the main model goal. To generate unbiased hazard ratios, multiple imputation methods were superior. This study shows the need to consider the best imputation approach for a predictive model development given the conditions of missing data and the goals of the analysis.

Peer Review reports

Introduction

Missing data is an inevitable challenge encountered in health surveys, which can compromise the representativeness of the sample, introduce bias, and reduce statistical power [1]. Several factors contribute to missing data, including non-response, and survey administration errors. To address this issue, imputation methods have been developed, with several techniques employed in practice [2]. The choice of imputation method depends on several factors, including the type and pattern of missing data, the assumptions about the missingness mechanism, and the specific goals of the analysis [1, 2].

Prediction models are valuable tools that estimate the likelihood of future outcomes or events based on available data. These models serve diverse purposes in healthcare, clinical care and population health. Clinical risk prediction models assess individual patient risk and support treatment decisions, often relying on data from electronic patient records (e.g., blood pressure, bloodwork, genetic markers) [3]. On the other hand, population risk algorithms predict disease incidence, evaluate the impact of risk factors, and inform population health interventions directed at groups of people versus at the individual level [4]. The accuracy and reliability of a prediction model largely depends on the quality and representativeness of the data, which can be influenced by the presence of missing data and the methods used to address it [5]. Existing prediction model reporting guidelines, such as the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD), recommend reporting on missingness in development and validation datasets and how missing data were addressed [6]. Despite these recommendations, the reporting and handling of missing data in prediction models is often inadequate [7,8,9].

Although there is existing literature on imputation methods in the context of survey data, there is a notable gap in our understanding regarding the impact of missing data for prediction models based on population surveys. Therefore, the objective of this study is to compare four common imputation methods, including complete case, mode imputation, single imputation, and multiple imputation, for handling missing values. This comparison aims to assess the effects of each imputation technique on model estimates and evaluate their impact on model performance.

Methods

The Premature Mortality Population Risk Tool (PreMPoRT) [10, 11] was developed and validated to predict the five-year incidence of premature mortality among Canadian adults. Model predictors included sociodemographic characteristics, self-perceived measures, health behaviours, and chronic conditions from national survey data. PreMPoRT demonstrated strong reproducibility and transportability in different validation data and performed well among important equity-stratified subgroups. Additional details about the development and validation of PreMPoRT are found elsewhere [10, 11].

We apply four missing data approaches: complete case, mode imputation, single imputation and multiple imputation using fully conditional specification (FCS) [12]. Six performance measures were used to assess the impact of each imputation method on the prediction model. The study received ethics approval from the University of Toronto Research Ethics Board (Protocol #37499). This work was supported by the Canadian Institutes of Health Research Operating Grants (FRNs: 72056684 and 72051628). Laura Rosella is also supported by a Canada Research Chair in Population Health Analytics (FRN: 72060091).

Data sources

PreMPoRT used data from the Canadian Community Health Survey (CCHS), a cross-sectional survey containing information on self-reported sociodemographic characteristics, health status, health care utilization, and health determinants. The surveyed population represents 98% of the Canadian population aged 12 and older [13] and uses a complex-survey design, including clustering and stratification, to represent all regions in Canada. The CCHS was linked to the Canadian Vital Statistics Database (CVSD) to ascertain premature mortality during a five-year follow-up period after CCHS interview date [14]. Data were held at the Statistics Canada Research Data Centre.

Participants

The study cohort consisted of participants who contributed to any of the first six cycles, 1.1 (2000/01), 2.1 (2003/04), 3.1 (2005/06), 2007/08, 2009/10 or 2011/12 of the CCHS. Individuals were removed from the cohort if they were pregnant or living in the Territories (Nunavut, Northwest Territories and Yukon) at the time of their CCHS interview date. Since PreMPoRT was developed for the Canadian adult population, individuals under 18 years old or over 75 were excluded.

Model specification

PreMPoRT predicts premature mortality, which under the Canadian Institute of Health Information (CIHI), is defined as any death under the age of 75 [15]. Using death dates from the CVSD, the outcome is all-cause mortality within five years after CCHS interview date or the participant’s 75th birthday. PreMPoRT was developed using sex-specific Weibull accelerated failure time models. Participants were followed for five years after interview date, death or until 75 years old, whichever came first.

Using 38 candidate predictors [10], PreMPoRT identified 12 predictors for the female model and 13 predictors for the male model. Both models contained age, household income quintile, education level, self-perceived general health, cigarette smoking, emphysema/COPD, heart disease, diabetes, cancer, and stroke. Body-mass-index (BMI) and physical activity were unique to the female model with marital status, Alzheimer’s disease, and arthritis being unique to the male model.

To accurately represent the Canadian population, CCHS survey weights were developed by Statistics Canada to handle the complex-survey design and to represent certain demographic groups properly [16]. Since multiple cycles were used in the analysis, CCHS survey weights were pooled and divided by the number of cycles [17].

Imputing missing values: four approaches

We used four different missing data methods to impute missing values. The first method was complete case, where any participant that had any missing predictor(s) was removed from the analysis. The second was mode imputation, where within each sex-stratified CCHS cycle, the most common value for any predictor(s) was imputed as the missing value.

The third method was single imputation using FCS [12]. Although PreMPoRT identified 12 predictors for females and 13 for males, imputation was run using all 38 candidate variables and the outcome [18]. Imputation was run separately for each cycle with the addition of stratifying by sex. However, due to converge issues all chronic conditions that had a low prevalence and less than 1% missingness were imputed as the absence of the condition (i.e., mode imputation for variables with less than 1% missing). These chronic conditions included emphysema/COPD, heart disease, diabetes, cancer, stroke, Alzheimer’s disease and arthritis. Afterwards, FCS was run five times as burn-in iterations to find convergence of the imputed values to create the imputed dataset. FCS used different regression models for each variable type, including logistic regression for binary variables, discriminant function for nominal variables and ordinal logistic regression for ordinal variables with more than two categories. Each variable was imputed within each CCHS cycle with the exception of anxiety and mood disorder in the first cycle as these questions were not asked in that cycle. After imputing all other variables within each cycle, anxiety and mood disorder for the first cycle were imputed using the next two CCHS cycles, as these were the other cycles within the development dataset. When building a prediction model it is important to avoid leakage between development and validation sets, as such imputing within each CCHS cycle as well as imputing anxiety and mood disorder within just the development cycles avoids all leakage from imputation.

The final method was multiple imputation (MI) which applied the same approach as single imputation to create four additional datasets for a total of five. The goal of MI is to generate multiple imputed datasets to observe how the distribution of the imputed values affects the results of the model.

Model performance and measures

To compare the effects of the imputation methods on the prediction model, the Weibull specific model parameters, hazard ratios (HRs), and performance measures were compared. The Weibull model parameters include the scale and shape parameters as well as the intercept. The hazard ratios compare the proportional increase in the rate of premature mortality versus the reference group and were calculated for each predictor in the model. Finally, six performance measures were compared to assess the model’s overall predictive accuracy, discrimination, and calibration.

The Nagelkerke R2 and Integrated Brier Score were used to assess the predictive accuracy. The Nagelkerke R2 measures the percent of variance explained by the model with a target value of one. The Integrated Brier Score measures the average squared difference between the outcome and the predicted risk (while taking censoring into account) with a target value of zero.

Discrimination is how well the model can differentiate between those who experience an outcome versus those who did not. This was assessed using Harrell’s concordance index (c-index) which is the fraction of the number of concordant pairs over the number of concordant pairs and discordant pairs [18]. A pair compares two participants in the study, and if the individual who had an event first had a higher predicted risk (concordant pair) the model properly predicted the outcome. However, if that individual had a lower predicted risk (discordant pair) then the model did not properly predict the outcome. Discrimination will also be assessed using time-specific discrimination slope, which is the difference in the average predicted risk of those who had an event and those who did not have an outcome.

Finally, calibration will be measured in the large and calibration slope. Calibration in the large is the difference between the average observed risk (normally calculated using Kaplan-Meier curves) and the average predicted risk. The calibration slope assesses if the betas are well-calibrated for the model. A slope of one indicates perfect calibration, less than one indicates the betas are overestimating the predicted risk, and more than one indicates the betas are underestimating the predicted risk. In addition, calibration plots were produced to show further the effect of imputation methods on the calibration of the prediction model.

Results

The highest proportion of missingness in any one variable was 10.86% for the female model and 8.24% for the male model. All chronic conditions, marital status, self-perceived general health and physical activity had less than 1% missingness. BMI, smoking status and individual education all had between 1% and 5% missingness, with income quintiles being the only variable with more than 5% missingness.

Baseline characteristics

Table 1 shows the weighted percent of baseline characteristics with unweighted total counts from the cohort rounded to the nearest thousand to adhere to Statistics Canada’s export requirements. All datasets have a total of 267,000 for females and 233,000 for males, except for the complete case, which had a total of 221,000 (17% removed) for females and 195,000 (16% removed) for males. Across all imputations, a total of 1.41% females and 2.06% of males experienced premature death, except for complete case (1.27% premature deaths for females and 1.93% premature deaths for males). There were no notable differences across imputation methods, apart from household income quintiles, which had a missingness of 10.86% for females and 8.24% for males. The lowest income quintile for females had a missingness of 15.11% for complete case and 15.50 − 21.14% for the remaining imputation methods. For males, the biggest difference was in the highest income quintile, with 29.24% missingness for complete case and 28.48 − 31.92% for other imputation methods.

Table 1 Baseline characteristics

Performance measures

Table 2 shows the variation in performance measures when applying the four imputation methods. The Nagelkerke R2 for the female model was 0.1084 for complete case, with the remaining imputation methods ranging between 0.1111 and 0.1120. The Nagelkerke R2 for the male model was 0.1050 for complete case and a range of 0.1078–0.1081 for other imputation methods. The c-index results were as follows: complete case was 0.8666, for females and 0.8549 for male, with the remaining methods giving a range of 0.8711–0.8719 and 0.8548–0.8553 for females and males, respectively.

The performance measures for calibration changed minimally across imputation methods. In addition to the performance measures, Figs. 1 and 2 show the average observed risk of premature mortality against the predicted risk of the model for females and males, respectively. Predicted risk is shown in deciles and the percent of observed cases that had a premature death in each decile was reported. Perfect calibration represents a slope of 1. The supplementary materials contain additional calibration plots from select predictors, including age groups, education level, ethnicity, immigration status and material deprivation. These show the percentage of premature deaths and compare them to the average predicted risk from each imputation method.

Table 2 Performance measures

Hazard ratios and confidence intervals

Table 3 shows the Weibull parameters and the HRs for the female and male models by imputation method. The female scale parameter was 0.7852 for complete case, and varied from 0.8194 to 0.8200 for the remaining imputation methods. The male scale parameter was 0.8137 for complete case and ranged from 0.8468 to 0.8472 for the other imputation methods. The HRs for all chronic conditions, age, self-perceived general health, cigarette smoking, physical activity, and marital status remained relatively unchanged between the imputation methods with the exception of complete case which did show noticeable differences across almost all predictors. Excluding complete case, household income demonstrated the biggest difference in confidence intervals for the female model. Specifically, the lowest income quintile (Q1) ranged from 1.22 to 1.26 for mode imputation to 1.09–1.33 for multiple imputation. The second highest quintile (Q4) ranged from 1.11 to 1.15 for mode imputation to 0.95–1.16 for multiple imputation. We observed similar variation for the male model with the lowest income quintile (Q1) ranging from 1.36 to 1.40 for mode imputation to 1.31–1.54 for multiple imputation and, the second highest quintile (Q4) ranged from 1.12 to 1.15 for mode imputation to 1.10–1.19 for multiple imputation.

Table 3 Hazard ratios
Fig. 1
figure 1

Calibration plot of predicted risk deciles versus average observed premature mortality for females

Fig. 2
figure 2

Calibration plot of predicted risk deciles versus average observed risk of premature mortality for males

Discussion

Although there are other imputation methods involving machine learning, this study aimed to investigate the effects of four missing data techniques on model coefficients and performance from a linked health survey. Our findings suggest that complete case imputation is not suitable for handling missing data when developing a prediction model. Interestingly, performance measures exhibited minimal changes across mode, single and multiple imputation. However, multiple imputation proved essential in obtaining accurate HRs and confidence intervals for predictors with a higher degree of missingness.

Complete case

Although complete case imputation is a commonly used technique for handling missing data, it is known to produce bias estimates and large standard errors when the missing data is not Missing Completely At Random (MCAR) [2]. In our study, the prevalence of premature deaths was reduced despite only removing a relatively small amount of the cohort. This observation suggests that individuals who experienced a premature death were more likely to have missing information, indicating a failure of the MCAR assumption. This bias is also particularly evident in the Weibull scale parameter.

While mode, single, and multiple imputation demonstrated only minor variations in the scale values, complete case imputation exhibited noticeable differences. Given that the scale parameter directly impacts the baseline survival, even slight changes can result in differences in predicted probabilities. The Nagelkerke R2, c-index and calibration-in-the-large all indicated poorer performance in the models using complete case imputation compared to mode, single, and multiple imputation, both for the female and male models. These results strongly suggest that complete case imputation is an inadequate method and should be avoided [2, 19].

Comparing performance measures

The results demonstrate similar performance when comparing mode, single, and multiple imputation techniques, with only marginal differences observed. This suggests that for risk prediction, single and multiple imputation offer minimal to no discernable benefit to model performance compared to mode imputation. Furthermore, when examining the calibration plots, all approaches tend to overpredict premature mortality at the higher-risk groups. This is due to less than 2% of the population having a risk greater than 20% risk of a five-year premature mortality. However, the variations between the different imputation methods are relatively minor, suggesting that the choice of imputation method has limited impact on the calibration of the models.

Comparing hazard ratios and confidence intervals

When comparing the imputation methods, the differences in HRs and confidence intervals are heavily influenced by the percent of missingness in each variable. Variables with less than 1% missingness, such as marital status, self-perceived general health, physical activity, and chronic conditions, show minimal changes in HRs. Multiple imputation, however, tends to yield slightly larger confidence intervals due to the inclusion of additional variance from the HRs across the five imputed datasets. Predictors with a higher degree of missingness, but still below 5%, demonstrate larger changes in HRs and wider ranges in the confidence intervals when employing multiple imputation. These predictors include individual education level, BMI, and smoking status.

Household income surpassed 5% missingness and exhibits notable differences in the female model. Multiple imputation showed the confidence intervals were underestimated in mode and single imputation. While all income quintiles, except the lowest income group (Q1), were found to be statistically significant in mode and single imputation, they were no longer statistically significant when using multiple imputation. For males, household income remained nearly unchanged between mode, single, and multiple imputation, just with larger confidence intervals. Consequently, variables with higher levels of missingness can exhibit unpredictable variations in whether their effects differ across different imputed datasets or remain consistent.

Limitations

This study should be interpreted considering the following limitations. First, individuals residing in the territories had to be removed given that area-based measures and household income were completely missing. Second, due to convergence issues with multiple imputation, all chronic conditions with low percent missingness were assigned the absence of the given condition (the most common occurrence in the data) and thus the effects of the different imputation methods could not be properly tested for these predictors. The highest missingness of a single variable was less than 11% and thus we could not compare the difference for variables with larger missingness. It is important when encountering data with higher levels of missing to note that the results here may not apply.

Conclusions

When dealing with missing data in population-based studies, the choice of imputation method depends on the specific goals of the analysis. Researchers should consider the trade-offs between simplicity and accuracy when selecting the appropriate imputation method for their analysis. Both single imputation and multiple imputation are complex imputation methods, requiring more time and methodological knowledge to properly impute missing data. As such, when working with population-based data with similar missingness, if the reader is solely interested in the overall performance of the model and not the individual effects of the predictors, mode imputation is an option. However, if an accurate estimation of predictor effects is of interest, the selection of the imputation method should consider the percentage of missingness in the variables. When predictors have a small percentage of missing values (less than 5%), then mode imputation is satisfactory. Once predictors have a higher percentage of missingness (5% or more), imputed values will introduce greater variability. In such cases, multiple imputation becomes essential to capture the effect of the imputed values accurately.

Data availability

All data used in this study belongs to Statistics Canada and cannot be shared publicly because of personal health information at the individual level. Data access is only permitted through Statistics Canada Research Data Centres (more information on eligibility and data request process can be found here: https://www.statcan.gc.ca/en/microdata/data-centres).

References

  1. Kang H. The prevention and handling of the missing data. Korean J Anesthesiol. 2013;64(5):402–6. https://doi.org/10.4097/kjae.2013.64.5.402.

    Article  PubMed  PubMed Central  Google Scholar 

  2. Newman DA. Missing Data: five practical guidelines. Organizational Res Methods. 2014;17(4):372–411. https://doi.org/10.1177/1094428114548590.

    Article  Google Scholar 

  3. Moons KGM, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and prognostic research: what, why, and how? BMJ. 2009;338:b375. https://doi.org/10.1136/bmj.b375.

    Article  PubMed  Google Scholar 

  4. Manuel DG, Rosella LC, Hennessy D, et al. Predictive risk algorithms in a population setting: an overview. J Epidemiol Community Health. 2012;66:859–65.

    Article  PubMed  Google Scholar 

  5. Nijman SWJ, Leeuwenberg AM, Beekers I, Verkouter I, Jacobs J, Bots ML, et al. Missing data is poorly handled and reported in prediction model studies using machine learning: a literature review. J Clin Epidemiol. 2022;142:218–229.

  6. Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD). Ann Intern Med. 2015;162(10):735–6.

    Article  PubMed  Google Scholar 

  7. Collins GS, et al. A systematic review finds prediction models for chronic kidney disease were poorly reported and often developed using inappropriate methods. J Clin Epidemiol. 2013;66(3):268–77.

    Article  PubMed  Google Scholar 

  8. Tsvetanova A, et al. Missing data was handled inconsistently in UK prediction models: a review of method used. J Clin Epidemiol. 2021;140:149–58.

    Article  PubMed  Google Scholar 

  9. Karahalios A, et al. A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures. BMC Med Res Methodol. 2012;12(1):1–10.

    Article  Google Scholar 

  10. O’Neill M, Hurst M, Palagan L, Diemert L, Kornas K et al. Development and validation of a Population based risk algorithm for premature mortality: The Premature Mortality Population Risk Tool (PreMPoRT).

  11. Rosella LC, O’Neill M, Fisher S, Hurst M, Diemert L, Kornas K, et al. A study protocol for a predictive algorithm to assess population-based premature mortality risk: premature Mortality Population Risk Tool (PreMPoRT). Diagn Progn Res. 2020;4(1):18.

    Article  PubMed  PubMed Central  Google Scholar 

  12. van Buuren S. Multiple imputation of discrete and continuous data by fully conditional specification. Stat Methods Med Res. 2007;16(3):219–42.

    Article  PubMed  Google Scholar 

  13. Beland Y. Canadian community health survey–methodological overview. Health Rep. 2002;13(3):9–14.

    PubMed  Google Scholar 

  14. Statistics Canada. Canadian Vital Death Statistics Database (CVSD) linked to Discharge Abstract Database (DAD) and National Ambulatory Care Reporting System (NACRS). [https://www.statcan.gc.ca/en/microdata/data-centres/data/cvsd-nacrs].

  15. Canadian Institute for Health Information. Health Indicators e-Publication. [https://www.cihi.ca/en/health-indicators-e-publication].

  16. Statistics Canada. Canadian Community Health Survey (CCHS) Household weights documentation. [https://www23.statcan.gc.ca/imdb-bmdi/pub/document/3226_D57_T9_V1-eng.htm].

  17. Thomas S, Wannell B. Combining cycles of the Canadian Community Health Survey. Health Rep. 2009;20(1):53–8.

    PubMed  Google Scholar 

  18. Harrell. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer; 2001.

  19. Daniel RM, Kenward MG, Cousens SN, De Stavola BL. Using causal diagrams to guide analysis in missing data problems. Stat Methods Med Res. 2012;21(3):243–56. https://doi.org/10.1177/0962280210394469.

    Article  PubMed  Google Scholar 

Download references

Funding

This work was supported by the Canadian Institutes of Health Research Operating Grants (FRNs: 72056684 and 72051628). Laura Rosella is also supported by a Canada Research Chair in Population Health Analytics (FRN: 72060091).

Author information

Authors and Affiliations

Authors

Contributions

M.H. and M.O. completed all the analysis work. M.H. and L.R. wrote the initial manuscript. All authors reviewed and provided important feedback and changes to the manuscript.

Corresponding author

Correspondence to Laura C. Rosella.

Ethics declarations

Ethical approval

The study received ethics approval from the University of Toronto Research Ethics Board (Protocol #37499).

Consent to participate

All respondants in the CCHS cycles were given consent to share their survey information with provincial and federal ministries of health as well as link their responses to their administrative data (more information available at: https://www.statcan.gc.ca/en/microdata/data-centres/data/cencchs-imdb).

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hurst, M., O’Neill, M., Pagalan, L. et al. The impact of different imputation methods on estimates and model performance: an example using a risk prediction model for premature mortality. Popul Health Metrics 22, 13 (2024). https://doi.org/10.1186/s12963-024-00331-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12963-024-00331-3

Keywords