Skip to main content

Table 1 Summary of imputation methods

From: Addressing missing values in routine health information system data: an evaluation of imputation methods using data from the Democratic Republic of the Congo during the COVID-19 pandemic

Single or multiple imputation

Name of imputation method

Description

R package

Level of complexity to implement

SI

1. Mean imputation

Missing values are replaced with the average of the entire non-missing population in the same month

N/A

Easy

 

2. Exclusion and interpolation

Firstly, any facilities with three or more consecutive missing monthly reports are excluded. Next, missing values in the remaining facilities are filled with interpolation

N/A

Easy

 

3. Nonparametric missing value imputation using random forest (missForest)

missForest is a relatively new random forest-based method, which treats the variable with missing values as a dependent variable and regresses it against all the other variables in the dataset through a random forest model. This process is repeated iteratively, and in each step, the missing values are filled with a better prediction. The iteration stops when some threshold is met, i.e. when the changes in the imputed values between steps become small enough. This method is popular because of its ability to handle both categorical and numerical data, as well as very little manual parameter tuning required in the implementation [4]

missForest [14]

Moderate

 

4. k Nearest neighbour (k-NN)

For each missing data point, the k-NN algorithm looks for the other k non-missing observations that are the most similar to the missing one, by comparing their distance measures. The missing data are then filled by a weighted average of the k neighbouring but non-missing observations, with the weights calculated based on their Euclidean distances to the missing data point. One difficulty in this method is the choice of k. In our study, we use the default number of k = 10 nearest neighbours, but the choice of k can be more carefully tuned through cross-validation [17]

DMwR [18]

Difficult: users are required to specify the parameter k

 

5. Seasonal decomposition

Seasonal decomposition is tailored to the handling of missing values in time series data and can be summarized in three steps. Firstly, it identifies and removes the seasonal component from the original time series. Next, the missing imputation is performed on the deseasonalized series. Finally, the seasonal component is added back to reflect seasonality [19]

ImputeTS [19]

Easy

MI

Multiple imputation

Multiple imputation also treats the variable with missing values as a dependent variable and estimates it based on the rest of the variables. This estimation is repeated multiple times (M times) with a random component involved and being slightly different in each estimation to account for the uncertainty in the missing values. M datasets with slightly different estimations of the missing values are returned at the end of the estimation procedure, and taking an average across the M estimations yields an unbiased estimate of the missing values. The multiple imputation by chained equations (mice) implementation in R, in particular, enables an iterative estimation of missing values in multiple variables and provides flexibility in imputing both categorical and continuous variables [20]. The two methods listed below (i.e. methods 6 and 7) are both within the MI family

mice [21]

Moderate to difficult

 

6. MI with predictive mean matching (PMM)

In each iteration of the mice procedure, each missing value is filled with the value of a donor, which is a complete data point whose predicted score from a fitted regression model is the closet to the predicted score of the missing data point [22]

mice [21]

Moderate

 

7. MI with 2-level Poisson

In each iteration of the mice procedure, imputation is accomplished through a mixed effects Poisson model which accounts for the longitudinal structure and/or cluster membership of the data

countimp [23]

Difficult: Users need to understand the dataset structure