[Stata] Multiple Imputation by Chained Equation (MICE)

Handling missing data is an essential part of any data analysis. Multiple imputation is a robust method to address this issue, filling in missing values multiple times to create several complete datasets. Stata provides a streamlined approach to conduct multiple imputation.

What is Multiple Imputation?

Dealing With Missing Data - Multiple Imputation

Watch this video on YouTube.

Multiple imputation (MI) is a way to deal with missing data in a dataset that may affect the validity and accuracy of statistical analyses. MI involves creating several possible values for the missing data, based on the observed data, and then combining the results from each imputed dataset to obtain a final estimate.

For example, suppose you have a dataset of 100 students’ test scores, but 10 of them are missing. You could use MI to generate 5 different values for each missing score, based on the distribution and correlation of the observed scores. Then you could perform a t-test or ANOVA on each of the 5 imputed datasets, and pool the results to get a single p-value and effect size.

MI can be used for both categorical and continuous variables, as long as the imputation method is appropriate for the type and level of measurement of the variable. For example, you could use logistic regression to impute binary variables or multinomial logistic regression to impute nominal variables.

One of the most popular methods for MI is MICE (Multiple Imputation by Chained Equations), which is an iterative algorithm that imputes each variable in turn, using the other variables as predictors. MICE can handle different types of variables, complex relationships, and interactions among variables. MICE also allows for different imputation models for different variables, such as linear regression, logistic regression, or predictive mean matching.

MICE works by following these steps: (reference here)

Initialize the missing values with random draws from the observed values of the same variable.
For each variable with missing values, perform the following steps:
- Delete the imputed values of that variable, leaving only the observed values.
- Regress that variable on the other variables in the dataset, using an appropriate model.
- Draw new imputed values from the posterior predictive distribution of that model.
Repeat these steps until convergence is reached, or a maximum number of iterations is reached.
Store the imputed dataset and repeat the whole process to create multiple imputed datasets.

In this post, I will show you how to do multiple imputation in Stata using the mi command. I will use a hypothetical data set with four variables: var1 (binary), var2 (ordinal), var3 (categorical), and var4 (continuous). These variables have some missing values, and I want to impute them using multiple imputation methods.

Step 1. Identify variables to impute and their level of measurement

The first step is to identify which variables have missing values, and what is their level of measurement. We can use the misstable summarize command to get a summary of the missing values in our data set. For example, to check the missing values for var1, var2, and var3, we can type:

Stata

misstable summarize var1 var2 var3 // put the list of variabels in your model

This will give us the number and percentage of missing values for each variable.

We can also use the codebook command to get more information about the variables, such as their labels, values, and types. For example, to get the codebook for var1, var2, and var3, we can type:

Stata

codebook var1 var2 var3 // put the list of variabels to impute

Then, you will figure out the level of measurement. For example, hlthstat variable is an ordinal variable (Likert scale), and diabetes is binary variable (yes/no).

Step 2. Register variables to impute

The next step is to register the variables that we want to impute using the mi set and mi register commands. The mi set command tells Stata that we are going to use multiple imputation, and the mi register command tells Stata which variables are imputed and which are not.

There are four ways to store the imputed data in Stata: wide, mlong, flong, and flongset.

http://repec.org/usug2009/uk09_marchenko.pdf

For general use with a moderate size of data, flong is the most used style (see this post).

Then, we can register the variables that we want to impute using the mi register command. We need to specify the keyword imputed before the list of variables to impute. For example, to register var1, var2, and var3 as imputed variables, we can type:

Stata

mi set flong
mi register imputed var1 var2 var3 // put the list of variables to impute

Step 3. Implement Multiple Imputation Methods

The final step is to implement the multiple imputation methods using the mi impute command. The mi impute command allows us to use different methods for different variables, depending on their level of measurement and distribution. The most common methods are:

logit for binary variables
ologit for ordinal variables
mlogit for categorical variables
reg for continuous variables
truncreg for continuous variables with truncation

For multiple imputation, chained type needs to be selected to have variables witb different levels of measurement. The mvn type only accepts continous variables, which mean that you need to convert all categorical variables into binary variables and you can’t mix different types.

We need to specify the method and the variable name in parentheses, separated by a space. We can also specify some options, such as the number of imputations (add), the random seed (rseed), and the name of the trace data set (savetrace).

For example, to impute var1 (binary) using logit, var2 (ordinal) using ologit, var3 (categorical) using mlogit, and var4 (continuous) using truncreg with a lower limit of 0 and an upper limit of 6, we can type:

Stata

mi impute chained (logit) varname (ologit) varname (mlogit) varname (truncreg, ll(0) ul(6)) varname, add(10) rseed (53421) savetrace(trace1, replace)

This will create 10 imputations for each variable, using a random seed of 53421, and save the trace data set as trace1. The trace data set contains the mean and standard deviation of the imputed variables for each imputation.

Step 4. Regressions with multiply imputed data

One of the most common analyses that we may want to do after imputing the missing values is to run a linear regression using the imputed data. We can use the mi estimate command to apply the regress command to the imputed data, and get the regression coefficients, standard errors, t-statistics, and p-values for each variable.

Stata

mi estimate: reg depvar indepvar1 indepvar2

This will give us the output of the regression for each imputation, as well as the pooled estimate across the imputations. The pooled estimate is obtained by combining the estimates and standard errors from each imputation using Rubin’s rules. The output will also include the degrees of freedom adjustment, which accounts for the uncertainty due to imputation.

Tip. Estimating VIF for multiply imputed data

To estimate VIF to check multicollinearity, you can install the user-created command mivif and then just run the command right after mi estimate: reg command.

Stata

ssc install mivif 
mi estimate: reg depvar indepvar1 indepvar2
mivif

Tip. Estimating the number of imputations needed

You can also calculate the number of imputations needed for your model based on Von Hippel (2020)’s approach, using the how_many_imputations command.

🔎 Von Hippel, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. Sociological Methods & Research, 49(3), 699-718.

Stata

ssc install how_many_imputations // install for the first time 
mi estimate: reg depvar indepvar1 indepvar2
how_many_imputations

This will return the number of imputations you have already runner and the minimum number of imputations needed. In this case, as I imputed more than needed, there is 0 imputation needed to add.

Step 5. Estimating summary statistics

After imputing the missing values, we may want to estimate some summary statistics for the imputed variables, such as the mean, standard deviation, minimum, and maximum. However, the usual commands for summary statistics, such as summarize and tabulate, do not work with imputed data.

We can use mi estimate: proportion or mi estimate: mean for summary statistics for multiple imputed data.

Stata

mi estimate: proportion varname // for categorical, binary variables
mi estimate: mean varname // for continuous variables.

RVI (Relative Variance Increase) shows the impact of missing data uncertainty, while FMI (Fraction Missing Information) indicates the degree of missing information. Higher FMI suggests greater uncertainty from missing data. MI yields one final set of coefficients with appropriate standard errors.

In the example above, the FMI of 0.54 indicates that over half of the information about this parameter is missing, which is concerning. The high RVI shows that the variance has increased by about 106% due to missing data. For the second estimate of the mean of vitamin C, this variable has significantly less missing information (FMI of only 7.7%). The variance increase due to missing data is modest (about 8%).

Tip. Estimate R-squared with beta coefficients for multiple imputation regressions

Another thing that we may want to do after imputing the missing values is to run a regression analysis using the imputed data. We can use the mi estimate command to apply any regression command to the imputed data, such as regress, logit, or ologit.

However, the mi estimate command does not report the R-squared for the regression models, which is a common measure of goodness-of-fit. To get the R-squared for the multiple imputation regressions, we need to use another user-written command called mibeta, which can be searched from the Stata website using the search command. For example, to search for mibeta, we can type:

Stata

search mibeta

This will give us a link to the mibeta command, which can be downloaded and installed from the Stata website. Then, we can use the mibeta command to get the R-squared for any regression model, as well as the standardized beta coefficients and the Fisher’s z transformation. For example, to get the R-squared for a regression of var4 on var1, var2, and var3, we can type:

Stata

mibeta depvar indepvar1 indepvar2 .., fisherz

This will give us the R-squared, Adjusted R-squared, and the beta coefficients.

❤️‍🔥Learn more: How can I estimate R-squared for a model estimated with multiply imputed data? | Stata FAQ (ucla.edu)

Step 6. Saving results with outreg2

You can also save the results based on regressions with multiple imputations by using outreg2 command. To save the results with R2 from mibeta, you can use the following command.

Stata

mibeta depvar indepvar1 indepvar2, fisherz 

*Saving R2
local r2mi= e(r2_mi)
*ADJ. R2
local ar2mi= e(r2_adj_mi)

mi estimate, dots post: reg depvar indepvar1 indepvar2 

*Output using outreg2
outreg2 using regression.xls, replace excel dec(2) alpha(0.001, 0.01, 0.05) addstat(dfModel, e(df_m_mi), dfError, e(df_r_mi), F, e(F_mi), R-squared, `r2mi', Adjusted R-squared, `ar2mi')

❤️‍🔥Learn more: mibeta command and statistical output – Statalist

Sample code with sample data

Stata

webuse nhanes2, clear

** DV: hlthstat
** IV: female agegrp bmi race vitaminc porphyrn fhtatk

* Step 1. Identify variables to impute and their level of measurement
misstable summarize hlthstat female agegrp bmi race vitaminc porphyrn fhtatk // put the list of variabels in your model 

codebook hlthstat female agegrp bmi race vitaminc porphyrn fhtatk
recode hlthstat (8=.)
// hlthstat: ordinal (likert) 
// vitaminc: continuous
// porphyrn: continuous
// fhtatk: binary 

* Step 2. Register variables to impute
mi set flong
mi register imputed hlthstat vitaminc porphyrn fhtatk // put the list of variables to impute 

* Step 3. Implement Multiple Imputation Methods
mi impute chained (logit) fhtatk (ologit) hlthstat (reg) vitaminc porphyrn, add(10) rseed (53421) 

* Step 4. Regressions with multiply imputed data
mi estimate: reg hlthstat i.female i.agegrp bmi i.race vitaminc porphyrn i.fhtatk
mivif // VIF 

mi estimate: proportion hlthstat
mi estimate: proportion fhtatk
mi estimate: mean vitaminc
mi estimate: mean porphyrn


mibeta hlthstat i.female i.agegrp bmi i.race vitaminc porphyrn i.fhtatk, fisherz // for r-squared