[Stata] Multiple Imputation by Chained Equation (MICE)
Handling missing data is an essential part of any data analysis. Multiple imputation is a robust method to address this issue, filling in missing values multiple times to create several complete datasets. Stata provides a streamlined approach to conduct multiple imputation.
What is Multiple Imputation?
Multiple imputation (MI) is a way to deal with missing data in a dataset that may affect the validity and accuracy of statistical analyses. MI involves creating several possible values for the missing data, based on the observed data, and then combining the results from each imputed dataset to obtain a final estimate.
For example, suppose you have a dataset of 100 students’ test scores, but 10 of them are missing. You could use MI to generate 5 different values for each missing score, based on the distribution and correlation of the observed scores. Then you could perform a t-test or ANOVA on each of the 5 imputed datasets, and pool the results to get a single p-value and effect size.
MI can be used for both categorical and continuous variables, as long as the imputation method is appropriate for the type and level of measurement of the variable. For example, you could use logistic regression to impute binary variables or multinomial logistic regression to impute nominal variables.
One of the most popular methods for MI is MICE (Multiple Imputation by Chained Equations), which is an iterative algorithm that imputes each variable in turn, using the other variables as predictors. MICE can handle different types of variables, complex relationships, and interactions among variables. MICE also allows for different imputation models for different variables, such as linear regression, logistic regression, or predictive mean matching.
MICE works by following these steps: (reference here)
- Initialize the missing values with random draws from the observed values of the same variable.
- For each variable with missing values, perform the following steps:
- Delete the imputed values of that variable, leaving only the observed values.
- Regress that variable on the other variables in the dataset, using an appropriate model.
- Draw new imputed values from the posterior predictive distribution of that model.
- Repeat these steps until convergence is reached, or a maximum number of iterations is reached.
- Store the imputed dataset and repeat the whole process to create multiple imputed datasets.
In this post, I will show you how to do multiple imputation in Stata using the mi
command. I will use a hypothetical data set with four variables: var1
(binary), var2
(ordinal), var3
(categorical), and var4
(continuous). These variables have some missing values, and I want to impute them using multiple imputation methods.
Step 1. Identify variables to impute and their level of measurement
The first step is to identify which variables have missing values, and what is their level of measurement. We can use the misstable summarize
command to get a summary of the missing values in our data set. For example, to check the missing values for var1
, var2
, and var3
, we can type:
misstable summarize var1 var2 var3 // put the list of variabels in your model
This will give us the number and percentage of missing values for each variable.
We can also use the codebook
command to get more information about the variables, such as their labels, values, and types. For example, to get the codebook for var1
, var2
, and var3
, we can type:
codebook var1 var2 var3 // put the list of variabels to impute
Then, you will figure out the level of measurement. For example, hlthstat
variable is an ordinal variable (Likert scale), and diabetes is binary
variable (yes/no).
Step 2. Register variables to impute
The next step is to register the variables that we want to impute using the mi set
and mi register
commands. The mi set
command tells Stata that we are going to use multiple imputation, and the mi register
command tells Stata which variables are imputed and which are not.
There are four ways to store the imputed data in Stata: wide
, mlong
, flong
, and flongset
.
For general use with a moderate size of data, flong
is the most used style (see this post).
Then, we can register the variables that we want to impute using the mi register
command. We need to specify the keyword imputed
before the list of variables to impute. For example, to register var1
, var2
, and var3
as imputed variables, we can type:
mi set flong
mi register imputed var1 var2 var3 // put the list of variables to impute
Step 3. Implement Multiple Imputation Methods
The final step is to implement the multiple imputation methods using the mi impute
command. The mi impute
command allows us to use different methods for different variables, depending on their level of measurement and distribution. The most common methods are:
logit
for binary variablesologit
for ordinal variablesmlogit
for categorical variablesreg
for continuous variablestruncreg
for continuous variables with truncation
For multiple imputation, chained type needs to be selected to have variables witb different levels of measurement. The mvn type only accepts continous variables, which mean that you need to convert all categorical variables into binary variables and you can’t mix different types.
We need to specify the method and the variable name in parentheses, separated by a space. We can also specify some options, such as the number of imputations (add
), the random seed (rseed
), and the name of the trace data set (savetrace
).
For example, to impute var1
(binary) using logit
, var2
(ordinal) using ologit
, var3
(categorical) using mlogit
, and var4
(continuous) using truncreg
with a lower limit of 0 and an upper limit of 6, we can type:
mi impute chained (logit) varname (ologit) varname (mlogit) varname (truncreg, ll(0) ul(6)) varname, add(10) rseed (53421) savetrace(trace1, replace)
This will create 10 imputations for each variable, using a random seed of 53421, and save the trace data set as trace1
. The trace data set contains the mean and standard deviation of the imputed variables for each imputation.
Step 4. Regressions with multiply imputed data
One of the most common analyses that we may want to do after imputing the missing values is to run a linear regression using the imputed data. We can use the mi estimate
command to apply the regress
command to the imputed data, and get the regression coefficients, standard errors, t-statistics, and p-values for each variable.
mi estimate: reg depvar indepvar1 indepvar2
This will give us the output of the regression for each imputation, as well as the pooled estimate across the imputations. The pooled estimate is obtained by combining the estimates and standard errors from each imputation using Rubin’s rules. The output will also include the degrees of freedom adjustment, which accounts for the uncertainty due to imputation.
Tip. Estimating VIF for multiply imputed data
To estimate VIF to check multicollinearity, you can install the user-created command mivif
and then just run the command right after mi estimate: reg
command.
ssc install mivif
mi estimate: reg depvar indepvar1 indepvar2
mivif
Tip. Estimating the number of imputations needed
You can also calculate the number of imputations needed for your model based on Von Hippel (2020)’s approach, using the how_many_imputations command.
🔎 Von Hippel, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. Sociological Methods & Research, 49(3), 699-718.
ssc install how_many_imputations // install for the first time
mi estimate: reg depvar indepvar1 indepvar2
how_many_imputations
This will return the number of imputations you have already runner and the minimum number of imputations needed. In this case, as I imputed more than needed, there is 0 imputation needed to add.
Step 5. Estimating summary statistics
After imputing the missing values, we may want to estimate some summary statistics for the imputed variables, such as the mean, standard deviation, minimum, and maximum. However, the usual commands for summary statistics, such as summarize
and tabulate
, do not work with imputed data.
We can use mi estimate: proportion
or mi estimate: mean
for summary statistics for multiple imputed data.
mi estimate: proportion varname // for categorical, binary variables
mi estimate: mean varname // for continuous variables.
Tip. Estimate R-squared with beta coefficients for multiple imputation regressions
Another thing that we may want to do after imputing the missing values is to run a regression analysis using the imputed data. We can use the mi estimate
command to apply any regression command to the imputed data, such as regress
, logit
, or ologit
.
However, the mi estimate
command does not report the R-squared for the regression models, which is a common measure of goodness-of-fit. To get the R-squared for the multiple imputation regressions, we need to use another user-written command called mibeta
, which can be searched from the Stata website using the search
command. For example, to search for mibeta
, we can type:
search mibeta
This will give us a link to the mibeta
command, which can be downloaded and installed from the Stata website. Then, we can use the mibeta
command to get the R-squared for any regression model, as well as the standardized beta coefficients and the Fisher’s z transformation. For example, to get the R-squared for a regression of var4
on var1
, var2
, and var3
, we can type:
mibeta depvar indepvar1 indepvar2 .., fisherz
This will give us the R-squared, Adjusted R-squared, and the beta coefficients.
❤️🔥Learn more: How can I estimate R-squared for a model estimated with multiply imputed data? | Stata FAQ (ucla.edu)
Step 6. Saving results with outreg2
You can also save the results based on regressions with multiple imputations by using outreg2
command. To save the results with R2 from mibeta
, you can use the following command.
mibeta depvar indepvar1 indepvar2, fisherz
*Saving R2
local r2mi= e(r2_mi)
*ADJ. R2
local ar2mi= e(r2_adj_mi)
mi estimate, dots post: reg depvar indepvar1 indepvar2
*Output using outreg2
outreg2 using regression.xls, replace excel dec(2) alpha(0.001, 0.01, 0.05) addstat(dfModel, e(df_m_mi), dfError, e(df_r_mi), F, e(F_mi), R-squared, `r2mi', Adjusted R-squared, `ar2mi')
❤️🔥Learn more: mibeta command and statistical output – Statalist
Sample code with sample data
webuse nhanes2, clear
** DV: hlthstat
** IV: female agegrp bmi race vitaminc porphyrn fhtatk
* Step 1. Identify variables to impute and their level of measurement
misstable summarize hlthstat female agegrp bmi race vitaminc porphyrn fhtatk // put the list of variabels in your model
codebook hlthstat female agegrp bmi race vitaminc porphyrn fhtatk
recode hlthstat (8=.)
// hlthstat: ordinal (likert)
// vitaminc: continuous
// porphyrn: continuous
// fhtatk: binary
* Step 2. Register variables to impute
mi set flong
mi register imputed hlthstat vitaminc porphyrn fhtatk // put the list of variables to impute
* Step 3. Implement Multiple Imputation Methods
mi impute chained (logit) fhtatk (ologit) hlthstat (reg) vitaminc porphyrn, add(10) rseed (53421)
* Step 4. Regressions with multiply imputed data
mi estimate: reg hlthstat i.female i.agegrp bmi i.race vitaminc porphyrn i.fhtatk
mivif // VIF
mi estimate: proportion hlthstat
mi estimate: proportion fhtatk
mi estimate: mean vitaminc
mi estimate: mean porphyrn
mibeta hlthstat i.female i.agegrp bmi i.race vitaminc porphyrn i.fhtatk, fisherz // for r-squared
Resources
Multiple Imputation in Stata (wisc.edu)
- Deciding to Impute (wisc.edu)
- Creating Imputation Models (wisc.edu)
- Imputing (wisc.edu)
- Managing Multiply Imputed Data (wisc.edu)
- Estimating (wisc.edu)
- Examples (wisc.edu)
- Recommended Readings (wisc.edu)
Stata Guide: Multiple Imputation: Imputation Step (mwn.de)
How can I perform multiple imputation on longitudinal data using ICE? | Stata FAQ (ucla.edu)
Multiple Imputation in Stata (ucla.edu)
https://stats.oarc.ucla.edu/wp-content/uploads/2016/09/Missing-Data-Techniques_UCLA_Stata.pdf (highly recommended)
Missing Values Analysis and Multiple Imputation in Stata
Advanced Handling of Missing Data One-day Workshop
Strategies and Guidelines for Handling Missing Data in Social Work Research
Missing Data and Multiple Imputation Decision Tree
How can I get margins for a multiply imputed survey logit model? | Stata FAQ (ucla.edu)