The required mice version for this practical exercise is mice > 3.1.0. We’ll install the latest version in this exercise.

This is the first exercise in the series. It will give you an introduction to the R-package mice, an open-source tool for flexible imputation of incomplete data. Over the last decade, mice has become an important piece of imputation software, offering a very flexible environment for dealing with incomplete data. Moreover, the ability to integrate mice with other packages in R, and vice versa, offers many options for applied researchers.

The aim of this introduction is to enhance your understanding of multiple imputation, in general. You will learn how to multiply impute simple datasets and how to obtain the imputed data for further analysis. The main objective is to increase your knowledge and understanding on applications of multiple imputation.

All the best,

Gerko and Stef

mice uses R’s random number generator to draw values with a probabilistic nature. Therefore, each time we use mice we will get slightly different results. To avoid this, we can fix the seed value of the random number generator.

set.seed(123)

With this seed you’ll get the exact same results if you follow the steps in this document. If you obtain different results, you have changed the order of the steps either by adding or re-running a step.

Installing the latest version of `mice`

We can use package devtools (which needs to be installed) to directly grab the latest version from the mice Github Page and compile the mice package from source.

install.packages("mice")

You can update all dependencies with more recent versions, if asked.

Working with mice

1. Open R and load the package mice

library(mice)

The version number for your mice can be found by running

version()

## [1] "mice 3.11.0 2020-08-02 /Library/Frameworks/R.framework/Versions/4.0/Resources/library"

2. Inspect the incomplete data

The mice package contains several datasets. Once the package is loaded, these datasets can be used. Have a look at the nhanes dataset (Schafer, 1997, Table 6.14) by typing

nhanes

##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

The nhanes dataset is a small data set with non-monotone missing values. It contains 25 observations on four variables: age group, body mass index, hypertension and cholesterol (mg/dL).

To learn more about the data, use one of the two following help commands:

help(nhanes)
?nhanes

3. Get an overview of the data by the summary() command:

summary(nhanes)

##       age            bmi             hyp             chl       
##  Min.   :1.00   Min.   :20.40   Min.   :1.000   Min.   :113.0  
##  1st Qu.:1.00   1st Qu.:22.65   1st Qu.:1.000   1st Qu.:185.0  
##  Median :2.00   Median :26.75   Median :1.000   Median :187.0  
##  Mean   :1.76   Mean   :26.56   Mean   :1.235   Mean   :191.4  
##  3rd Qu.:2.00   3rd Qu.:28.93   3rd Qu.:1.000   3rd Qu.:212.0  
##  Max.   :3.00   Max.   :35.30   Max.   :2.000   Max.   :284.0  
##                 NA's   :9       NA's   :8       NA's   :10

Using summary() on data sets is often informative, because the distributional information (continuous variables) or the frequency distribution (factors) for every column in your data frame is printed to the R console. However, if there are too many variables, a step-by-step approach may be more useful.

4. Inspect the missing data pattern

Check the missingness pattern for the nhanes dataset

md.pattern(nhanes)

##    age hyp bmi chl   
## 13   1   1   1   1  0
## 3    1   1   1   0  1
## 1    1   1   0   1  1
## 1    1   0   0   1  2
## 7    1   0   0   0  3
##      0   8   9  10 27

The missingness pattern shows that there are 27 missing values in total: 10 for chl , 9 for bmi and 8 for hyp. Moreover, there are thirteen completely observed rows, four rows with 1 missing, one row with 2 missings and seven rows with 3 missings. Looking at the missing data pattern is always useful (but may be difficult for datasets with many variables). It can give you an indication on how much information is missing and how the missingness is distributed.

Ad Hoc imputation methods

5. Form a regression model where age is predicted from bmi.

We can use the with() family of functions for this. The following function call

fit <- with(nhanes, lm(age ~ bmi))

evaluates with(data, expression), so it evaluates the linear model lm(age ~ bmi) on data set nhanes. The resulting object fit is identical to the output from lm(age ~ bmi, data = nhanes). We learn the with() function now, because we need it later when we start evaluating analytical models on multiply imputed data sets.

If we ask the summary of the fitted regression analysis, we obtain:

summary(fit)

## 
## Call:
## lm(formula = age ~ bmi)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2660 -0.5614 -0.1225  0.4660  1.2344 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  3.76718    1.31945   2.855   0.0127 *
## bmi         -0.07359    0.04910  -1.499   0.1561  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8015 on 14 degrees of freedom
##   (9 observations deleted due to missingness)
## Multiple R-squared:  0.1383, Adjusted R-squared:  0.07672 
## F-statistic: 2.246 on 1 and 14 DF,  p-value: 0.1561

No significant effect for bmi when we model the age variable.

6. Impute the missing data in the nhanes dataset with mean imputation. The following function call imputes the mean (method = "mean") for every incomplete column in the nhanes data set and returns a single (m = 1) imputed data set. The algorithm - i.e. the method that generates the imputations - has been given a single iteration (maxit = 1) to reach convergence. We’ll dive into the specifics of algorithmic convergence with mice in the next exercise.

imp <- mice(nhanes, method = "mean", m = 1, maxit = 1)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl

The imputations are now done. Running only a single imputation is practically efficient, as substituting each missing data multiple times with the observed data mean would not make any sense (the inference would be equal, no matter which imputed dataset we would analyze). Likewise, more iterations would be computationally inefficient as the observed data mean does not change based on our imputations. We named the imputed object imp following the convention used in mice, but if you wish you can name it anything you’d like.

7. Explore the imputed data with the complete() function. What do you think the variable means are? What happened to the regression equation after imputation?

We use the function complete(), which by default returns the first completed data set. Since we only have a single imputation for every missing datum, this makes sense and we do not have to change the default behavior of complete().

complete(imp)

##    age     bmi      hyp   chl
## 1    1 26.5625 1.235294 191.4
## 2    2 22.7000 1.000000 187.0
## 3    1 26.5625 1.000000 187.0
## 4    3 26.5625 1.235294 191.4
## 5    1 20.4000 1.000000 113.0
## 6    3 26.5625 1.235294 184.0
## 7    1 22.5000 1.000000 118.0
## 8    1 30.1000 1.000000 187.0
## 9    2 22.0000 1.000000 238.0
## 10   2 26.5625 1.235294 191.4
## 11   1 26.5625 1.235294 191.4
## 12   2 26.5625 1.235294 191.4
## 13   3 21.7000 1.000000 206.0
## 14   2 28.7000 2.000000 204.0
## 15   1 29.6000 1.000000 191.4
## 16   1 26.5625 1.235294 191.4
## 17   3 27.2000 2.000000 284.0
## 18   2 26.3000 2.000000 199.0
## 19   1 35.3000 1.000000 218.0
## 20   3 25.5000 2.000000 191.4
## 21   1 26.5625 1.235294 191.4
## 22   1 33.2000 1.000000 229.0
## 23   1 27.5000 1.000000 131.0
## 24   3 24.9000 1.000000 191.4
## 25   2 27.4000 1.000000 186.0

We see the repetitive numbers 26.5625 for bmi, 1.2352594 for hyp, and 191.4 for chl. These can be confirmed as the means of the respective variables (columns):

colMeans(nhanes, na.rm = TRUE)

##        age        bmi        hyp        chl 
##   1.760000  26.562500   1.235294 191.400000

We’ve seen during the inspection of the missing data pattern that variable age has no missings. Therefore nothing is imputed for age because we would not want to alter the observed (and bonafide) values.

To inspect the regression model with the imputed data, run:

fit <- with(imp, lm(age ~ bmi))
summary(fit)

## # A tibble: 2 x 6
##   term        estimate std.error statistic p.value  nobs
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl> <int>
## 1 (Intercept)   3.71      1.33        2.80  0.0103    25
## 2 bmi          -0.0736    0.0497     -1.48  0.152     25

It is clear that our inference did not change, but then again this is not surprising as variable bmi is more-or-less normally distributed and we are just adding weight to the mean.

densityplot(nhanes$bmi)

8. Impute the missing data in the nhanes dataset with regression imputation.

We can use the same function call as under exercise 7, with method = "norm.predict", which yields predictions from the normal linear regression model.

imp <- mice(nhanes, method = "norm.predict", m = 1, maxit = 1)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl

The imputations are now done. This code imputes the missing values in the data set by the regression imputation method. The argument method = "norm.predict" first fits a regression model for each observed value, based on the corresponding values in other variables and then imputes the missing values with the fitted (predicted) values from the normal linear regression model.

9. Again, inspect the completed data and investigate the imputed data regression model.

The completed data:

complete(imp)

##    age      bmi       hyp      chl
## 1    1 28.36021 1.0474831 172.4557
## 2    2 22.70000 1.0000000 187.0000
## 3    1 28.36021 1.0000000 187.0000
## 4    3 22.80609 1.5088506 222.7836
## 5    1 20.40000 1.0000000 113.0000
## 6    3 22.68531 1.5019433 184.0000
## 7    1 22.50000 1.0000000 118.0000
## 8    1 30.10000 1.0000000 187.0000
## 9    2 22.00000 1.0000000 238.0000
## 10   2 27.04536 1.3053438 208.0862
## 11   1 29.82242 1.0746600 182.9223
## 12   2 25.46237 1.2712595 196.7785
## 13   3 21.70000 1.0000000 206.0000
## 14   2 28.70000 2.0000000 204.0000
## 15   1 29.60000 1.0000000 181.6849
## 16   1 25.58231 0.8886142 153.1107
## 17   3 27.20000 2.0000000 284.0000
## 18   2 26.30000 2.0000000 199.0000
## 19   1 35.30000 1.0000000 218.0000
## 20   3 25.50000 2.0000000 239.8485
## 21   1 28.31995 1.0451807 172.1753
## 22   1 33.20000 1.0000000 229.0000
## 23   1 27.50000 1.0000000 131.0000
## 24   3 24.90000 1.0000000 240.5268
## 25   2 27.40000 1.0000000 186.0000

The repetitive numbering we saw under mean imputation is now gone when we impute the conditional mean - i.e. the expectation of age for every given bmi. We have now obtained a more natural looking set of imputations: instead of filling in the same bmi for all ages, we now take age (as well as hyp and chl) into account when imputing bmi.

To inspect the regression model with the imputed data, run:

fit <- with(imp, lm(age ~ bmi))
summary(fit)

## # A tibble: 2 x 6
##   term        estimate std.error statistic  p.value  nobs
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl> <int>
## 1 (Intercept)    4.68     1.12        4.20 0.000345    25
## 2 bmi           -0.110    0.0417     -2.64 0.0146      25

It is clear that our inference has changed. In fact, we extrapolated (part of) the regression model for the observed data to missing data in bmi. In other words; the relation (read: information) gets stronger and we’ve obtained more observations that conform exactly to the relation in the observed data. From an inferential statistics viewpoint, this approach would only be valid if we have definitive proof that the unobserved values would exactly conform to the observed data. If this assumption does not hold, norm.predict creates too little variation in our data set and we can not trust the resulting standard errors and p-values.

10. Impute the missing data in the nhanes dataset with stochastic regression imputation. With stochastic regression imputation, an error term is added to the predicted values, such that the imputations show variation around the regression line. The errors are normally distributed with mean 0 and variance equal to the residual variance.

imp <- mice(nhanes, method = "norm.nob", m = 1, maxit = 1)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl

The imputations are now done. This code imputes the missing values in the data set by the stochastic regression imputation method. The function does not incorporate the variability of the regression weights, so it is not ‘proper’ in the sense of Rubin (1987). For small samples, the variability of the imputed data will be underestimated.

11. Again, inspect the completed data and investigate the imputed data regression model.

complete(imp)

##    age      bmi       hyp      chl
## 1    1 33.61471 1.0064774 200.0025
## 2    2 22.70000 1.0000000 187.0000
## 3    1 32.36556 1.0000000 187.0000
## 4    3 29.94711 1.2555111 291.6824
## 5    1 20.40000 1.0000000 113.0000
## 6    3 20.02676 1.5286735 184.0000
## 7    1 22.50000 1.0000000 118.0000
## 8    1 30.10000 1.0000000 187.0000
## 9    2 22.00000 1.0000000 238.0000
## 10   2 20.09440 0.9498627 192.3497
## 11   1 32.65078 1.1459837 211.3078
## 12   2 20.12858 1.3255888 148.9863
## 13   3 21.70000 1.0000000 206.0000
## 14   2 28.70000 2.0000000 204.0000
## 15   1 29.60000 1.0000000 210.7834
## 16   1 26.85249 0.7870282 187.5259
## 17   3 27.20000 2.0000000 284.0000
## 18   2 26.30000 2.0000000 199.0000
## 19   1 35.30000 1.0000000 218.0000
## 20   3 25.50000 2.0000000 261.4307
## 21   1 36.35340 1.4367806 230.8058
## 22   1 33.20000 1.0000000 229.0000
## 23   1 27.50000 1.0000000 131.0000
## 24   3 24.90000 1.0000000 228.5297
## 25   2 27.40000 1.0000000 186.0000

We have once more obtained a more natural looking set of imputations, where instead of filling in the same bmi for all ages, we now take age (as well as hyp and chl) into account when imputing bmi. We also add a random error to allow for our imputations to be off the regression line.

To inspect the regression model with the imputed data, run:

fit <- with(imp, lm(age ~ bmi))
summary(fit)

## # A tibble: 2 x 6
##   term        estimate std.error statistic   p.value  nobs
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl> <int>
## 1 (Intercept)   3.91      0.826       4.73 0.0000907    25
## 2 bmi          -0.0793    0.0300     -2.64 0.0145       25

12. Re-run the stochastic imputation model with seed 123 and verify if your results are the same as the ones below

## # A tibble: 2 x 6
##   term        estimate std.error statistic   p.value  nobs
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl> <int>
## 1 (Intercept)   3.75      0.736       5.10 0.0000362    25
## 2 bmi          -0.0792    0.0287     -2.77 0.0110       25

The imputation procedure uses random sampling, and therefore, the results will be (perhaps slightly) different if we repeat the imputations. In order to get exactly the same result, you can use the seed argument

imp <- mice(nhanes, method = "norm.nob", m = 1, maxit = 1, seed = 123)
fit <- with(imp, lm(age ~ bmi))
summary(fit)

where 123 is some arbitrary number that you can choose yourself. Re-running this command will always yields the same imputed values. The ability to replicate one’s findings exactly is considered essential in today’s reproducible science.

Multiple imputation

13. Let us impute the missing data in the nhanes dataset To do multiple imputation, we can simply call mice() on our data set:

imp <- mice(nhanes)

## 
##  iter imp variable
##   1   1  bmi  hyp  chl
##   1   2  bmi  hyp  chl
##   1   3  bmi  hyp  chl
##   1   4  bmi  hyp  chl
##   1   5  bmi  hyp  chl
##   2   1  bmi  hyp  chl
##   2   2  bmi  hyp  chl
##   2   3  bmi  hyp  chl
##   2   4  bmi  hyp  chl
##   2   5  bmi  hyp  chl
##   3   1  bmi  hyp  chl
##   3   2  bmi  hyp  chl
##   3   3  bmi  hyp  chl
##   3   4  bmi  hyp  chl
##   3   5  bmi  hyp  chl
##   4   1  bmi  hyp  chl
##   4   2  bmi  hyp  chl
##   4   3  bmi  hyp  chl
##   4   4  bmi  hyp  chl
##   4   5  bmi  hyp  chl
##   5   1  bmi  hyp  chl
##   5   2  bmi  hyp  chl
##   5   3  bmi  hyp  chl
##   5   4  bmi  hyp  chl
##   5   5  bmi  hyp  chl

The imputations are now done. As you can see, the algorithm ran for 5 iterations (the default) and presented us with 5 imputations for each missing datum. For the rest of this document we will omit printing of the iteration cycle when we run mice. We do so by adding print=F to the mice call.

imp

## Class: mids
## Number of multiple imputations:  5 
## Imputation methods:
##   age   bmi   hyp   chl 
##    "" "pmm" "pmm" "pmm" 
## PredictorMatrix:
##     age bmi hyp chl
## age   0   1   1   1
## bmi   1   0   1   1
## hyp   1   1   0   1
## chl   1   1   1   0

The object imp contains a multiply imputed data set (of class mids). It encapsulates all information from imputing the nhanes dataset, such as the original data, the imputed values, the number of missing values, number of iterations, and so on.

To obtain an overview of the information stored in the object imp, use the attributes() function:

attributes(imp)

## $names
##  [1] "data"            "imp"             "m"               "where"          
##  [5] "blocks"          "call"            "nmis"            "method"         
##  [9] "predictorMatrix" "visitSequence"   "formulas"        "post"           
## [13] "blots"           "seed"            "iteration"       "lastSeedValue"  
## [17] "chainMean"       "chainVar"        "loggedEvents"    "version"        
## [21] "date"           
## 
## $class
## [1] "mids"

For example, the original data are stored as

imp$data

##    age  bmi hyp chl
## 1    1   NA  NA  NA
## 2    2 22.7   1 187
## 3    1   NA   1 187
## 4    3   NA  NA  NA
## 5    1 20.4   1 113
## 6    3   NA  NA 184
## 7    1 22.5   1 118
## 8    1 30.1   1 187
## 9    2 22.0   1 238
## 10   2   NA  NA  NA
## 11   1   NA  NA  NA
## 12   2   NA  NA  NA
## 13   3 21.7   1 206
## 14   2 28.7   2 204
## 15   1 29.6   1  NA
## 16   1   NA  NA  NA
## 17   3 27.2   2 284
## 18   2 26.3   2 199
## 19   1 35.3   1 218
## 20   3 25.5   2  NA
## 21   1   NA  NA  NA
## 22   1 33.2   1 229
## 23   1 27.5   1 131
## 24   3 24.9   1  NA
## 25   2 27.4   1 186

and the imputations are stored as

imp$imp

## $age
## [1] 1 2 3 4 5
## <0 rows> (or 0-length row.names)
## 
## $bmi
##       1    2    3    4    5
## 1  28.7 28.7 28.7 22.7 29.6
## 3  26.3 27.5 27.2 30.1 22.0
## 4  24.9 27.4 27.5 22.7 25.5
## 6  25.5 22.5 21.7 24.9 27.4
## 10 22.5 22.7 27.4 22.5 20.4
## 11 30.1 22.5 29.6 27.2 21.7
## 12 22.5 27.2 22.0 26.3 22.7
## 16 30.1 28.7 28.7 20.4 35.3
## 21 30.1 27.2 29.6 33.2 35.3
## 
## $hyp
##    1 2 3 4 5
## 1  1 1 1 1 1
## 4  1 2 2 1 1
## 6  1 2 1 1 1
## 10 1 1 1 1 1
## 11 2 1 2 1 1
## 12 1 1 2 1 1
## 16 2 1 1 1 1
## 21 2 1 1 1 1
## 
## $chl
##      1   2   3   4   5
## 1  187 187 187 238 206
## 4  218 218 218 218 229
## 10 238 187 184 238 118
## 11 229 118 229 187 131
## 12 131 204 187 218 118
## 15 131 187 199 199 218
## 16 187 238 187 187 204
## 20 218 218 284 218 206
## 21 186 131 199 184 199
## 24 184 184 184 218 131

14. Extract the completed data

By default, mice() calculates five (m = 5) imputed data sets. In order to get the third imputed data set, use the complete() function

c3 <- complete(imp, 3) 
md.pattern(c3)

##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##    age bmi hyp chl  
## 25   1   1   1   1 0
##      0   0   0   0 0

The collection of the \(m\) imputed data sets can be exported by function complete() in long, broad and repeated formats. For example,

c.long <- complete(imp, "long")  
c.long

##     .imp .id age  bmi hyp chl
## 1      1   1   1 28.7   1 187
## 2      1   2   2 22.7   1 187
## 3      1   3   1 26.3   1 187
## 4      1   4   3 24.9   1 218
## 5      1   5   1 20.4   1 113
## 6      1   6   3 25.5   1 184
## 7      1   7   1 22.5   1 118
## 8      1   8   1 30.1   1 187
## 9      1   9   2 22.0   1 238
## 10     1  10   2 22.5   1 238
## 11     1  11   1 30.1   2 229
## 12     1  12   2 22.5   1 131
## 13     1  13   3 21.7   1 206
## 14     1  14   2 28.7   2 204
## 15     1  15   1 29.6   1 131
## 16     1  16   1 30.1   2 187
## 17     1  17   3 27.2   2 284
## 18     1  18   2 26.3   2 199
## 19     1  19   1 35.3   1 218
## 20     1  20   3 25.5   2 218
## 21     1  21   1 30.1   2 186
## 22     1  22   1 33.2   1 229
## 23     1  23   1 27.5   1 131
## 24     1  24   3 24.9   1 184
## 25     1  25   2 27.4   1 186
## 26     2   1   1 28.7   1 187
## 27     2   2   2 22.7   1 187
## 28     2   3   1 27.5   1 187
## 29     2   4   3 27.4   2 218
## 30     2   5   1 20.4   1 113
## 31     2   6   3 22.5   2 184
## 32     2   7   1 22.5   1 118
## 33     2   8   1 30.1   1 187
## 34     2   9   2 22.0   1 238
## 35     2  10   2 22.7   1 187
## 36     2  11   1 22.5   1 118
## 37     2  12   2 27.2   1 204
## 38     2  13   3 21.7   1 206
## 39     2  14   2 28.7   2 204
## 40     2  15   1 29.6   1 187
## 41     2  16   1 28.7   1 238
## 42     2  17   3 27.2   2 284
## 43     2  18   2 26.3   2 199
## 44     2  19   1 35.3   1 218
## 45     2  20   3 25.5   2 218
## 46     2  21   1 27.2   1 131
## 47     2  22   1 33.2   1 229
## 48     2  23   1 27.5   1 131
## 49     2  24   3 24.9   1 184
## 50     2  25   2 27.4   1 186
## 51     3   1   1 28.7   1 187
## 52     3   2   2 22.7   1 187
## 53     3   3   1 27.2   1 187
## 54     3   4   3 27.5   2 218
## 55     3   5   1 20.4   1 113
## 56     3   6   3 21.7   1 184
## 57     3   7   1 22.5   1 118
## 58     3   8   1 30.1   1 187
## 59     3   9   2 22.0   1 238
## 60     3  10   2 27.4   1 184
## 61     3  11   1 29.6   2 229
## 62     3  12   2 22.0   2 187
## 63     3  13   3 21.7   1 206
## 64     3  14   2 28.7   2 204
## 65     3  15   1 29.6   1 199
## 66     3  16   1 28.7   1 187
## 67     3  17   3 27.2   2 284
## 68     3  18   2 26.3   2 199
## 69     3  19   1 35.3   1 218
## 70     3  20   3 25.5   2 284
## 71     3  21   1 29.6   1 199
## 72     3  22   1 33.2   1 229
## 73     3  23   1 27.5   1 131
## 74     3  24   3 24.9   1 184
## 75     3  25   2 27.4   1 186
## 76     4   1   1 22.7   1 238
## 77     4   2   2 22.7   1 187
## 78     4   3   1 30.1   1 187
## 79     4   4   3 22.7   1 218
## 80     4   5   1 20.4   1 113
## 81     4   6   3 24.9   1 184
## 82     4   7   1 22.5   1 118
## 83     4   8   1 30.1   1 187
## 84     4   9   2 22.0   1 238
## 85     4  10   2 22.5   1 238
## 86     4  11   1 27.2   1 187
## 87     4  12   2 26.3   1 218
## 88     4  13   3 21.7   1 206
## 89     4  14   2 28.7   2 204
## 90     4  15   1 29.6   1 199
## 91     4  16   1 20.4   1 187
## 92     4  17   3 27.2   2 284
## 93     4  18   2 26.3   2 199
## 94     4  19   1 35.3   1 218
## 95     4  20   3 25.5   2 218
## 96     4  21   1 33.2   1 184
## 97     4  22   1 33.2   1 229
## 98     4  23   1 27.5   1 131
## 99     4  24   3 24.9   1 218
## 100    4  25   2 27.4   1 186
## 101    5   1   1 29.6   1 206
## 102    5   2   2 22.7   1 187
## 103    5   3   1 22.0   1 187
## 104    5   4   3 25.5   1 229
## 105    5   5   1 20.4   1 113
## 106    5   6   3 27.4   1 184
## 107    5   7   1 22.5   1 118
## 108    5   8   1 30.1   1 187
## 109    5   9   2 22.0   1 238
## 110    5  10   2 20.4   1 118
## 111    5  11   1 21.7   1 131
## 112    5  12   2 22.7   1 118
## 113    5  13   3 21.7   1 206
## 114    5  14   2 28.7   2 204
## 115    5  15   1 29.6   1 218
## 116    5  16   1 35.3   1 204
## 117    5  17   3 27.2   2 284
## 118    5  18   2 26.3   2 199
## 119    5  19   1 35.3   1 218
## 120    5  20   3 25.5   2 206
## 121    5  21   1 35.3   1 199
## 122    5  22   1 33.2   1 229
## 123    5  23   1 27.5   1 131
## 124    5  24   3 24.9   1 131
## 125    5  25   2 27.4   1 186

and

c.broad <- complete(imp, "broad")

## New names:
## * age -> age...1
## * bmi -> bmi...2
## * hyp -> hyp...3
## * chl -> chl...4
## * age -> age...5
## * ...

c.broad

##    age.1 bmi.1 hyp.1 chl.1 age.2 bmi.2 hyp.2 chl.2 age.3 bmi.3 hyp.3 chl.3
## 1      1  28.7     1   187     1  28.7     1   187     1  28.7     1   187
## 2      2  22.7     1   187     2  22.7     1   187     2  22.7     1   187
## 3      1  26.3     1   187     1  27.5     1   187     1  27.2     1   187
## 4      3  24.9     1   218     3  27.4     2   218     3  27.5     2   218
## 5      1  20.4     1   113     1  20.4     1   113     1  20.4     1   113
## 6      3  25.5     1   184     3  22.5     2   184     3  21.7     1   184
## 7      1  22.5     1   118     1  22.5     1   118     1  22.5     1   118
## 8      1  30.1     1   187     1  30.1     1   187     1  30.1     1   187
## 9      2  22.0     1   238     2  22.0     1   238     2  22.0     1   238
## 10     2  22.5     1   238     2  22.7     1   187     2  27.4     1   184
## 11     1  30.1     2   229     1  22.5     1   118     1  29.6     2   229
## 12     2  22.5     1   131     2  27.2     1   204     2  22.0     2   187
## 13     3  21.7     1   206     3  21.7     1   206     3  21.7     1   206
## 14     2  28.7     2   204     2  28.7     2   204     2  28.7     2   204
## 15     1  29.6     1   131     1  29.6     1   187     1  29.6     1   199
## 16     1  30.1     2   187     1  28.7     1   238     1  28.7     1   187
## 17     3  27.2     2   284     3  27.2     2   284     3  27.2     2   284
## 18     2  26.3     2   199     2  26.3     2   199     2  26.3     2   199
## 19     1  35.3     1   218     1  35.3     1   218     1  35.3     1   218
## 20     3  25.5     2   218     3  25.5     2   218     3  25.5     2   284
## 21     1  30.1     2   186     1  27.2     1   131     1  29.6     1   199
## 22     1  33.2     1   229     1  33.2     1   229     1  33.2     1   229
## 23     1  27.5     1   131     1  27.5     1   131     1  27.5     1   131
## 24     3  24.9     1   184     3  24.9     1   184     3  24.9     1   184
## 25     2  27.4     1   186     2  27.4     1   186     2  27.4     1   186
##    age.4 bmi.4 hyp.4 chl.4 age.5 bmi.5 hyp.5 chl.5
## 1      1  22.7     1   238     1  29.6     1   206
## 2      2  22.7     1   187     2  22.7     1   187
## 3      1  30.1     1   187     1  22.0     1   187
## 4      3  22.7     1   218     3  25.5     1   229
## 5      1  20.4     1   113     1  20.4     1   113
## 6      3  24.9     1   184     3  27.4     1   184
## 7      1  22.5     1   118     1  22.5     1   118
## 8      1  30.1     1   187     1  30.1     1   187
## 9      2  22.0     1   238     2  22.0     1   238
## 10     2  22.5     1   238     2  20.4     1   118
## 11     1  27.2     1   187     1  21.7     1   131
## 12     2  26.3     1   218     2  22.7     1   118
## 13     3  21.7     1   206     3  21.7     1   206
## 14     2  28.7     2   204     2  28.7     2   204
## 15     1  29.6     1   199     1  29.6     1   218
## 16     1  20.4     1   187     1  35.3     1   204
## 17     3  27.2     2   284     3  27.2     2   284
## 18     2  26.3     2   199     2  26.3     2   199
## 19     1  35.3     1   218     1  35.3     1   218
## 20     3  25.5     2   218     3  25.5     2   206
## 21     1  33.2     1   184     1  35.3     1   199
## 22     1  33.2     1   229     1  33.2     1   229
## 23     1  27.5     1   131     1  27.5     1   131
## 24     3  24.9     1   218     3  24.9     1   131
## 25     2  27.4     1   186     2  27.4     1   186

are completed data sets in long and broad format, respectively. See ?complete for more detail.

Conclusion

We have seen that (multiple) imputation is straightforward with mice. However, don’t let the simplicity of the software fool you into thinking that the problem itself is also straightforward. In the next exercise we will therefore explore how the mice package can flexibly provide us the tools to assess and control the imputation of missing data.

References

Rubin, D. B. Multiple imputation for nonresponse in surveys. John Wiley & Sons, 1987. Amazon

Schafer, J.L. (1997). Analysis of Incomplete Multivariate Data. London: Chapman & Hall. Table 6.14. Amazon

Van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate Imputation by Chained Equations in R. Journal of Statistical Software, 45(3), 1-67. pdf

- End of exercise

Ad hoc methods and `mice`

Gerko Vink and Stef van Buuren

Multiple Imputation in Practice

Installing the latest version of `mice`

Working with mice

Ad Hoc imputation methods

Multiple imputation

Conclusion

References

Ad hoc methods and mice

Gerko Vink and Stef van Buuren

Multiple Imputation in Practice

Installing the latest version of mice

Working with mice

Ad Hoc imputation methods

Multiple imputation

Conclusion

References

Ad hoc methods and `mice`

Installing the latest version of `mice`