mice
: Passive imputation and Post-processingThis is the fourth exercise in the series.
Exact replication of the imputations and figures in this exercise depends on mice version > 3.2.1
. The latest version of mice
can be obtained from:
devtools::install_github("stefvanbuuren/mice")
If you have an earlier mice
version, the imputations and resulting plots may slightly differ. The overall train of thought for this document for all mice
versions is equivalent.
In this exercise we will walk you through the more advanced features of mice
, such as post-processing of imputations and passive imputation.
All the best,
1. Open R
and load the packages mice
and dplyr
.
library(mice) # Data imputation
library(dplyr) # Data manipulation
library(lattice) # Plotting device
set.seed(123)
We choose seed value 123
. This is an arbitrary value; any value would be an equally good seed value. Fixing the random seed enables you (and others) to exactly replicate anything that involves random number generators. If you set the seed in your R
instance to 123
, you will get the exact same results and plots as we present in this document if you follow the order and code of the exercises precisely.
There is often a need for transformed, combined or recoded versions of the data. In the case of incomplete data, one could impute the original, and transform the completed original afterwards, or transform the incomplete original and impute the transformed version. If, however, both the original and the transformed version are needed within the imputation algorithm, neither of these approaches work: One cannot be sure that the transformation holds between the imputed values of the original and transformed versions. mice
has a built-in approach, called passive imputation, to deal with situations as described above. The goal of passive imputation is to maintain the consistency among different transformations of the same data. As an example, consider the following deterministic function in the boys
data \[\text{BMI} = \frac{\text{Weight (kg)}}{\text{Height}^2 \text{(m)}}\] or the compositional relation in the mammalsleep data: \[\text{ts} = \text{ps}+\text{sws}\]
2. Use passive imputation to impute the deterministic sleep relation in the mammalsleep
data. Name the new multiply imputed dataset pas.imp
.
First, we create a method
vector:
meth <- make.method(mammalsleep)
meth
## species bw brw sws ps ts mls gt pi sei
## "" "" "" "pmm" "pmm" "pmm" "pmm" "pmm" "" ""
## odi
## ""
and a predictorMatrix
:
pred <- make.predictorMatrix(mammalsleep)
pred
## species bw brw sws ps ts mls gt pi sei odi
## species 0 1 1 1 1 1 1 1 1 1 1
## bw 1 0 1 1 1 1 1 1 1 1 1
## brw 1 1 0 1 1 1 1 1 1 1 1
## sws 1 1 1 0 1 1 1 1 1 1 1
## ps 1 1 1 1 0 1 1 1 1 1 1
## ts 1 1 1 1 1 0 1 1 1 1 1
## mls 1 1 1 1 1 1 0 1 1 1 1
## gt 1 1 1 1 1 1 1 0 1 1 1
## pi 1 1 1 1 1 1 1 1 0 1 1
## sei 1 1 1 1 1 1 1 1 1 0 1
## odi 1 1 1 1 1 1 1 1 1 1 0
We add the call for passive imputation to the ts
element in the meth
object
meth["ts"]<- "~ I(sws + ps)"
meth
## species bw brw sws ps
## "" "" "" "pmm" "pmm"
## ts mls gt pi sei
## "~ I(sws + ps)" "pmm" "pmm" "" ""
## odi
## ""
and set the predictor relations for ts
with sws
and ps
to 0
. Also, we have to exclude Species
as a predictor
pred[c("sws", "ps"), "ts"] <- 0
pred[, "species"] <- 0
pred
## species bw brw sws ps ts mls gt pi sei odi
## species 0 1 1 1 1 1 1 1 1 1 1
## bw 0 0 1 1 1 1 1 1 1 1 1
## brw 0 1 0 1 1 1 1 1 1 1 1
## sws 0 1 1 0 1 0 1 1 1 1 1
## ps 0 1 1 1 0 0 1 1 1 1 1
## ts 0 1 1 1 1 0 1 1 1 1 1
## mls 0 1 1 1 1 1 0 1 1 1 1
## gt 0 1 1 1 1 1 1 0 1 1 1
## pi 0 1 1 1 1 1 1 1 0 1 1
## sei 0 1 1 1 1 1 1 1 1 0 1
## odi 0 1 1 1 1 1 1 1 1 1 0
This avoids circularity problems where ts
would feed back into sws
and ps
, from which it is calculated:
We can then run the imputations as
pas.imp <- mice(mammalsleep,
meth = meth,
pred = pred,
maxit = 10,
seed = 123,
print = F)
We used a custom predictor matrix and method vector to tailor our imputation approach to the passive imputation problem. We made sure to exclude ts
as a predictor for the imputation of sws
and ps
to avoid circularity.
We also gave the imputation algorithm 10 iterations to converge and fixed the seed to 123
for this mice
instance. This means that even when people do not fix the overall R
seed for a session, exact replication of results can be obtained by simply fixing the seed
for the random number generator within mice
. Naturally, the same input (data) is each time required to yield the same output (mids
-object).
3. Inspect the trace lines for pas.imp
.
plot(pas.imp)
We can see that the pathological nonconvergence we experienced before has been properly dealt with. The trace lines for the sleep variable look okay now and convergence can be inferred by studying the trace lines.
Remember that we imputed the boys
data previously with pmm
and with norm
, where we saw that pmm
preserves the observed data range and norm
allows us to extend outside of the range of the observed data. If we were to impute e.g. tv
with norm
the problem arises that there may be negative values among the imputations. Somehow we should be able to
tv
.The mice()
function has an argument called post
that takes a vector of strings of R
commands. These commands are parsed and evaluated after the univariate imputation function returns, and thus provides a way of post-processing the imputed values while using the processed version in the imputation algorithm. In other words; the post-processing allows us to manipulate the imputations for a particular variable that are generated within each iteration. Such manipulations directly affect the imputated values of that variable and the imputations for other variables. Naturally, such a procedure should be handled with care.
4. Post-process the values to constrain them between 1 and 25, use norm
as the imputation method for tv
.
meth <- make.method(boys)
meth["tv"] <- "norm"
post <- make.post(boys)
post["tv"] <- "imp[[j]][, i] <- squeeze(imp[[j]][, i], c(1, 25))"
imp <- mice(boys,
meth = meth,
post = post,
print = FALSE)
In this way the imputed values of tv
are constrained (squeezed by function squeeze()
) between 1 and 25.
5. Compare the imputed values and histograms of tv
for the solution obtained by pmm
to the constrained solution (created with norm
, constrained between 1 and 25).
First, we recreate the default pmm
solution
imp.pmm <- mice(boys, print=FALSE)
and we inspect the imputed values for the norm
solution
table(complete(imp)$tv)
##
## 1 1.01856398578212 1.0305265041997 1.05371730709378
## 274 1 1 1
## 1.07965881459222 1.17266016197262 1.18331918874243 1.19850340482835
## 1 1 1 1
## 1.20751363430595 1.22270349091234 1.24393898381574 1.27416237305213
## 1 1 1 1
## 1.37264069156637 1.37687168658768 1.4038468461377 1.4888922915118
## 1 1 1 1
## 1.502948063712 1.61094129141498 1.63561475531001 1.69154151039902
## 1 1 1 1
## 1.69734085647902 1.75962760046502 1.83703239510597 1.85090932372616
## 1 1 1 1
## 1.85196792529081 1.88497590530004 1.93677698925981 1.98714218266051
## 1 1 1 1
## 2 2.17006035212886 2.19420629366948 2.20709019333567
## 26 1 1 1
## 2.32217300175937 2.33718756021663 2.37925919724656 2.42589950587828
## 1 1 1 1
## 2.42852062650941 2.56258286438301 2.57154776173857 2.69125369721839
## 1 1 1 1
## 2.81422946755381 2.849182607152 2.85721219486192 2.86802164234102
## 1 1 1 1
## 2.9050436581017 2.94455194340074 2.99053395697965 3
## 1 1 1 19
## 3.09186572494736 3.09293831549814 3.12759115294094 3.137769304381
## 1 1 1 1
## 3.29169176719331 3.34327503822682 3.35169753351213 3.36118943119167
## 1 1 1 1
## 3.42894421492264 3.64409914804099 3.7342091243475 3.73449420271883
## 1 1 1 1
## 3.83351487625141 3.94898811919176 4 4.18580253805296
## 1 1 17 1
## 4.50814821532778 4.5899062419111 4.59879652563211 4.80896223356025
## 1 1 1 1
## 4.86693622505292 4.90707701009326 4.92083322380665 4.9919404935006
## 1 1 1 1
## 5 5.11550701472057 5.2675575534524 5.30699364355918
## 5 1 1 1
## 5.48138217837872 5.9411679485539 6 6.13652148511074
## 1 1 10 1
## 6.13894103637924 6.1565094226148 6.25483968121638 6.36617952529365
## 1 1 1 1
## 6.40319326867511 6.42478740352842 6.55030483069993 6.67210208568764
## 1 1 1 1
## 6.71487285512647 6.83897647017751 6.93446621600148 7.08357417630785
## 1 1 1 1
## 7.09178439681213 7.30234534974404 7.68154460182069 7.73437101947725
## 1 1 1 1
## 7.80770238610074 7.89989442429735 7.92324157979704 8
## 1 1 1 13
## 8.17602314803671 8.19348000001015 8.28049913219575 8.32280884879302
## 1 1 1 1
## 8.48903749609463 8.56769013301768 8.83864037628557 8.95992325272691
## 1 1 1 1
## 9 9.04477390963831 9.06982897383724 9.16417756212673
## 1 1 1 1
## 9.22417249742369 9.33358098899267 9.36992148086693 9.3862952342802
## 1 1 1 1
## 9.49848143941098 9.52973463446122 9.64297608115544 9.68522718414533
## 1 1 1 1
## 9.72561963180977 10 10.1015146146315 10.1969672444551
## 1 16 1 1
## 10.2613845378965 10.3639992156231 10.4322244960718 10.8757137316089
## 1 1 1 1
## 11.0985138246362 11.1352224839287 11.3599999270311 11.4990864477122
## 1 1 1 1
## 11.8090031513037 11.8742443878141 12 12.0259908745644
## 1 1 15 1
## 12.0290193781978 12.1423054413162 12.1509343718361 12.1560181173864
## 1 1 1 1
## 12.2007201444469 12.2963796526214 12.4212404510045 12.5885772833468
## 1 1 1 1
## 12.5999948122602 12.8650017360119 12.9359546851943 13
## 1 1 1 1
## 13.0084110316048 13.412960283899 13.4165985714268 13.5794619759477
## 1 1 1 1
## 13.8630567814608 13.8860429760291 13.9683922516302 14
## 1 1 1 1
## 14.3621064783652 14.4509003464646 14.6622055659847 14.7174313478183
## 1 1 1 1
## 14.8430588927294 14.8648650187386 14.8852811047132 14.9654799894298
## 1 1 1 1
## 14.9873182113118 15 15.0142846452428 15.2079680604487
## 1 27 1 1
## 15.5365929876549 15.5510774742241 15.7122378270619 15.7512214068479
## 1 1 1 1
## 15.7602201466366 16 16.0276057729029 16.1422595831028
## 1 1 1 1
## 16.2602970074183 16.3532832122099 16.5894669994605 16.7422615082676
## 1 1 1 1
## 16.7678748258145 16.9875879043095 17 17.1056430929331
## 1 1 1 1
## 17.1208203940252 17.1692258100321 17.3020554780255 17.390371459762
## 1 1 1 1
## 17.5071899885911 17.5632721921984 17.5635369602171 17.7034937492546
## 1 1 1 1
## 17.7362651302672 17.7423061432335 17.8199124977756 17.872015418962
## 1 1 1 1
## 17.9014749677662 17.9439322565708 17.9822469567094 17.9927913403406
## 1 1 1 1
## 18 18.0779873286959 18.1249421079672 18.1295138365517
## 1 1 1 1
## 18.2552503507931 18.3353828727134 18.7178960787053 18.7462985774108
## 1 1 1 1
## 18.9901829270432 19.1962126352428 19.3746688212472 19.4997160572489
## 1 1 1 1
## 19.913785710524 19.9509336126402 19.9611334615165 20
## 1 1 1 38
## 20.0135656758663 20.0255142674244 20.0973750863173 20.1340851087617
## 1 1 1 1
## 20.1344697493209 20.1745129107608 20.2379726133558 20.3995806508845
## 1 1 1 1
## 20.5168922378414 20.5401931227474 20.5513501437093 20.6685882917064
## 1 1 1 1
## 20.7232027963561 20.8210450804746 20.8671659772625 21.0244601710077
## 1 1 1 1
## 21.0392416860597 21.0537147951292 21.2401873005436 21.2948878442176
## 1 1 1 1
## 21.4539529970948 21.4608358275169 21.6328642529837 21.8963277584571
## 1 1 1 1
## 21.9752362407994 22.0466545246583 22.3478714189136 22.3830985461179
## 1 1 1 1
## 22.6548184951425 22.7147244193324 22.7629694736002 22.8549824442405
## 1 1 1 1
## 22.9496558563585 23.6775419376966 23.8326554734164 24.0841386441071
## 1 1 1 1
## 24.8989082843077 24.9661702218352 25
## 1 1 45
and for the pmm
solution
table(complete(imp.pmm)$tv)
##
## 1 2 3 4 5 6 8 9 10 12 13 14 15 16 17 18 20 25
## 67 236 88 28 6 17 22 1 30 28 4 3 66 1 4 2 72 73
It is clear that the norm solution - as expected - does not give us integer data as imputations. Next, we inspect and compare the density of the incomplete and imputed data for the constrained solution.
densityplot(imp, ~tv)
A nice way of plotting the histograms of both datasets simultaneously is by creating first the dataframe (here we named it tvm
) that contains the data in one column and the imputation method in another column.
tv <- c(complete(imp.pmm)$tv, complete(imp)$tv)
used.method <- rep(c("pmm", "norm"), each = nrow(boys))
tvm <- data.frame(tv = tv, method = used.method)
and then plotting a histogram of tv
conditional on method.
histogram( ~tv | method, data = tvm, nint = 25)
Is there still a difference in distribution between the two different imputation methods? Which imputations are more plausible do you think?
6. Make a missing data indicator (name it miss
) for bmi
and check the relation of bmi
, wgt
and hgt
for the boys in the imputed data. To do so, plot the imputed values against their respective calculated values.
miss <- is.na(imp$data$bmi)
xyplot(imp, bmi ~ I (wgt / (hgt / 100)^2),
na.groups = miss, cex = c(0.8, 1.2), pch = c(1, 20),
ylab = "BMI (kg/m2) Imputed", xlab = "BMI (kg/m2) Calculated")
With this plot we show that the relation between hgt
, wgt
and bmi
is not preserved in the imputed values. In order to preserve this relation, we should use passive imputation.
7. Use passive imputation to preserve the relation between imputed bmi
, wgt
and hgt
by setting the imputation method for bmi
to:
meth["bmi"]<- "~ I(wgt / (hgt / 100)^2)"
but do not solve for circularity.
meth <- make.method(boys)
meth["bmi"] <- "~ I(wgt / (hgt / 100)^2)"
imp <- mice(boys,
meth = meth,
print=FALSE)
8. Again, plot the imputed values of bmi
versus the calculated values and check whether there is convergence for bmi
.
To inspect the relation:
xyplot(imp, bmi ~ I(wgt / (hgt / 100)^2), na.groups = miss,
cex = c(1, 1), pch = c(1, 20),
ylab = "BMI (kg/m2) Imputed", xlab = "BMI (kg/m2) Calculated")
To study convergence for bmi
alone:
plot(imp, c("bmi"))
Although the relation of bmi
is preserved now in the imputations we get absurd imputations and on top of that we clearly see there are some problems with the convergence of bmi
. The problem is that we have circularity in the imputations. We used passive imputation for bmi
but bmi
is also automatically used as predictor for wgt
and hgt
. This can be solved by adjusting the predictor matrix.
9. Solve the problem of circularity (if you did not already do so) and plot once again the imputed values of bmi versus the calculated values.
First, we remove bmi
as a predictor for hgt
and wgt
to remove circularity.
pred <- make.predictorMatrix(boys)
pred[c("hgt", "wgt"), "bmi"] <- 0
pred
## age hgt wgt bmi hc gen phb tv reg
## age 0 1 1 1 1 1 1 1 1
## hgt 1 0 1 0 1 1 1 1 1
## wgt 1 1 0 0 1 1 1 1 1
## bmi 1 1 1 0 1 1 1 1 1
## hc 1 1 1 1 0 1 1 1 1
## gen 1 1 1 1 1 0 1 1 1
## phb 1 1 1 1 1 1 0 1 1
## tv 1 1 1 1 1 1 1 0 1
## reg 1 1 1 1 1 1 1 1 0
and we run the mice
algorithm again with the new predictor matrix (we still ‘borrow’ the imputation methods object meth
from before)
imp <-mice(boys,
meth = meth,
pred = pred,
print = FALSE)
Second, we recreate the plots from Assignment 8. We start with the plot to inspect the relations in the observed and imputed data
xyplot(imp, bmi ~ I(wgt / (hgt / 100)^2), na.groups = miss,
cex=c(1, 1), pch=c(1, 20),
ylab="BMI (kg/m2) Imputed", xlab="BMI (kg/m2) Calculated")
and continue with the trace plot to study convergence
plot(imp, c("bmi"))
All is well now!
We have seen that the practical execution of multiple imputation and pooling is straightforward with the R
package mice
. The package is designed to allow you to assess and control the imputations themselves, the convergence of the algorithm and the distributions and multivariate relations of the observed and imputed data.
It is important to ‘gain’ this control as a user. After all, we are imputing values and taking their uncertainty properly into account. Being also uncertain about the process that generated those values is therefor not a valid option.
Never set all relations fixed. You will remain with the starting values and waste your computer’s energy (and your own).
meth<- make.method(boys)
pred <- make.predictorMatrix(boys)
meth["bmi"]<- "~ I(wgt / (hgt / 100)^2)"
meth["wgt"]<- "~ I(bmi * (hgt / 100)^2)"
meth["hgt"]<- "~ I(sqrt(wgt / bmi) * 100)"
pred[c("bmi", "wgt", "hgt"), ] <- 0
imp.path <- mice(boys,
meth=meth,
pred=pred,
seed=123)
##
## iter imp variable
## 1 1 hgt wgt bmi hc gen phb tv reg
## 1 2 hgt wgt bmi hc gen phb tv reg
## 1 3 hgt wgt bmi hc gen phb tv reg
## 1 4 hgt wgt bmi hc gen phb tv reg
## 1 5 hgt wgt bmi hc gen phb tv reg
## 2 1 hgt wgt bmi hc gen phb tv reg
## 2 2 hgt wgt bmi hc gen phb tv reg
## 2 3 hgt wgt bmi hc gen phb tv reg
## 2 4 hgt wgt bmi hc gen phb tv reg
## 2 5 hgt wgt bmi hc gen phb tv reg
## 3 1 hgt wgt bmi hc gen phb tv reg
## 3 2 hgt wgt bmi hc gen phb tv reg
## 3 3 hgt wgt bmi hc gen phb tv reg
## 3 4 hgt wgt bmi hc gen phb tv reg
## 3 5 hgt wgt bmi hc gen phb tv reg
## 4 1 hgt wgt bmi hc gen phb tv reg
## 4 2 hgt wgt bmi hc gen phb tv reg
## 4 3 hgt wgt bmi hc gen phb tv reg
## 4 4 hgt wgt bmi hc gen phb tv reg
## 4 5 hgt wgt bmi hc gen phb tv reg
## 5 1 hgt wgt bmi hc gen phb tv reg
## 5 2 hgt wgt bmi hc gen phb tv reg
## 5 3 hgt wgt bmi hc gen phb tv reg
## 5 4 hgt wgt bmi hc gen phb tv reg
## 5 5 hgt wgt bmi hc gen phb tv reg
plot(imp.path, c("hgt", "wgt", "bmi"))
We named the mids
object imp.path
, because the nonconvergence is pathological in this example!
- End of exercise