# [Stata] Multiple Imputation by Chained Equation (MICE)

Handling missing data is an essential part of any data analysis. Multiple imputation is a robust method to address this issue, filling in missing values **multiple times** to create **several complete datasets**. Stata provides a streamlined approach to conduct multiple imputation.

### What is Multiple Imputation?

Multiple imputation (MI) is a way to deal with missing data in a dataset that may affect the validity and accuracy of statistical analyses. MI involves creating several possible values for the missing data, based on the observed data, and then combining the results from each imputed dataset to obtain a final estimate.

For example, suppose you have a dataset of 100 students’ test scores, but 10 of them are missing. You could use MI to generate 5 different values for each missing score, based on the distribution and correlation of the observed scores. Then you could perform a t-test or ANOVA on each of the 5 imputed datasets, and pool the results to get a single p-value and effect size.

MI can be used for both categorical and continuous variables, as long as the imputation method is appropriate for the type and level of measurement of the variable. For example, you could use logistic regression to impute binary variables or multinomial logistic regression to impute nominal variables.

One of the most popular methods for MI is **MICE (Multiple Imputation by Chained Equations)**, which is an **iterative algorithm** that imputes each variable in turn, using the other variables as predictors. MICE can handle different types of variables, complex relationships, and interactions among variables. MICE also allows for different imputation models for different variables, such as linear regression, logistic regression, or predictive mean matching.

MICE works by following these steps: (reference here)

- Initialize the missing values with random draws from the observed values of the same variable.
- For each variable with missing values, perform the following steps:
- Delete the imputed values of that variable, leaving only the observed values.
- Regress that variable on the other variables in the dataset, using an appropriate model.
- Draw new imputed values from the posterior predictive distribution of that model.

- Repeat these steps until convergence is reached, or a maximum number of iterations is reached.
- Store the imputed dataset and repeat the whole process to create multiple imputed datasets.

In this post, I will show you how to do multiple imputation in Stata using the `mi`

command. I will use a hypothetical data set with four variables: `var1`

(binary), `var2`

(ordinal), `var3`

(categorical), and `var4`

(continuous). These variables have some missing values, and I want to impute them using multiple imputation methods.

### Step 1. Identify variables to impute and their level of measurement

The first step is to identify which variables have missing values, and what is their level of measurement. We can use the `misstable summarize`

command to get a summary of the missing values in our data set. For example, to check the missing values for `var1`

, `var2`

, and `var3`

, we can type:

`misstable summarize var1 var2 var3 // put the list of variabels in your model `

This will give us the number and percentage of missing values for each variable.

We can also use the `codebook`

command to get more information about the variables, such as their labels, values, and types. For example, to get the codebook for `var1`

, `var2`

, and `var3`

, we can type:

`codebook var1 var2 var3 // put the list of variabels to impute`

Then, you will figure out the level of measurement. For example, `hlthstat`

variable is an ordinal variable (Likert scale), and diabetes is `binary`

variable (yes/no).

### Step 2. Register variables to impute

The next step is to register the variables that we want to impute using the `mi set`

and `mi register`

commands. The `mi set`

command tells Stata that we are going to use multiple imputation, and the `mi register`

command tells Stata which variables are imputed and which are not.

There are four ways to store the imputed data in Stata: `wide`

, `mlong`

, `flong`

, and `flongset`

.

For general use with a moderate size of data, `flong`

is the most used style (see this post).

Then, we can register the variables that we want to impute using the `mi register`

command. We need to specify the keyword `imputed`

before the list of variables to impute. For example, to register `var1`

, `var2`

, and `var3`

as imputed variables, we can type:

```
mi set flong
mi register imputed var1 var2 var3 // put the list of variables to impute
```

### Step 3. Implement Multiple Imputation Methods

The final step is to implement the multiple imputation methods using the `mi impute`

command. The `mi impute`

command allows us to use different methods for different variables, depending on their level of measurement and distribution. The most common methods are:

`logit`

for binary variables`ologit`

for ordinal variables`mlogit`

for categorical variables`reg`

for continuous variables`truncreg`

for continuous variables with truncation

For multiple imputation, chained type needs to be selected to have variables witb different levels of measurement. The mvn type only accepts continous variables, which mean that you need to convert all categorical variables into binary variables and you can’t mix different types.

We need to specify the method and the variable name in parentheses, separated by a space. We can also specify some options, such as the number of imputations (`add`

), the random seed (`rseed`

), and the name of the trace data set (`savetrace`

).

For example, to impute `var1`

(binary) using `logit`

, `var2`

(ordinal) using `ologit`

, `var3`

(categorical) using `mlogit`

, and `var4`

(continuous) using `truncreg`

with a lower limit of 0 and an upper limit of 6, we can type:

`mi impute chained (logit) varname (ologit) varname (mlogit) varname (truncreg, ll(0) ul(6)) varname, add(10) rseed (53421) savetrace(trace1, replace) `

This will create 10 imputations for each variable, using a random seed of 53421, and save the trace data set as `trace1`

. The trace data set contains the mean and standard deviation of the imputed variables for each imputation.

### Step 4. Regressions with multiply imputed data

One of the most common analyses that we may want to do after imputing the missing values is to run a linear regression using the imputed data. We can use the `mi estimate`

command to apply the `regress`

command to the imputed data, and get the regression coefficients, standard errors, t-statistics, and p-values for each variable.

`mi estimate: reg depvar indepvar1 indepvar2`

This will give us the output of the regression for each imputation, as well as the pooled estimate across the imputations. The pooled estimate is obtained by combining the estimates and standard errors from each imputation using Rubin’s rules. The output will also include the degrees of freedom adjustment, which accounts for the uncertainty due to imputation.

**Tip. Estimating VIF for multiply imputed data**

To estimate VIF to check multicollinearity, you can install the user-created command `mivif`

and then just run the command right after `mi estimate: reg`

command.

```
ssc install mivif
mi estimate: reg depvar indepvar1 indepvar2
mivif
```

**Tip. Estimating the number of imputations needed **

You can also calculate the number of imputations needed for your model based on Von Hippel (2020)’s approach, using the how_many_imputations command.

🔎 Von Hippel, P. T. (2020). How many imputations do you need? A two-stage calculation using a quadratic rule. *Sociological Methods & Research*, *49*(3), 699-718.

```
ssc install how_many_imputations // install for the first time
mi estimate: reg depvar indepvar1 indepvar2
how_many_imputations
```

This will return the number of imputations you have already runner and the minimum number of imputations needed. In this case, as I imputed more than needed, there is 0 imputation needed to add.

### Step 5. Estimating summary statistics

After imputing the missing values, we may want to estimate some summary statistics for the imputed variables, such as the mean, standard deviation, minimum, and maximum. However, the usual commands for summary statistics, such as `summarize`

and `tabulate`

, do not work with imputed data.

We can use `mi estimate: proportion`

or `mi estimate: mean`

for summary statistics for multiple imputed data.

```
mi estimate: proportion varname // for categorical, binary variables
mi estimate: mean varname // for continuous variables.
```

#### Tip. Estimate R-squared with beta coefficients for multiple imputation regressions

Another thing that we may want to do after imputing the missing values is to run a regression analysis using the imputed data. We can use the `mi estimate`

command to apply any regression command to the imputed data, such as `regress`

, `logit`

, or `ologit`

.

However, the `mi estimate`

command does not report the R-squared for the regression models, which is a common measure of goodness-of-fit. To get the R-squared for the multiple imputation regressions, we need to use another user-written command called `mibeta`

, which can be searched from the Stata website using the `search`

command. For example, to search for `mibeta`

, we can type:

`search mibeta`

This will give us a link to the `mibeta`

command, which can be downloaded and installed from the Stata website. Then, we can use the `mibeta`

command to get the R-squared for any regression model, as well as the standardized beta coefficients and the Fisher’s z transformation. For example, to get the R-squared for a regression of `var4`

on `var1`

, `var2`

, and `var3`

, we can type:

`mibeta depvar indepvar1 indepvar2 .., fisherz`

This will give us the R-squared, Adjusted R-squared, and the beta coefficients.

❤️🔥Learn more: How can I estimate R-squared for a model estimated with multiply imputed data? | Stata FAQ (ucla.edu)

### Step 6. Saving results with outreg2

You can also save the results based on regressions with multiple imputations by using `outreg2`

command. To save the results with R2 from `mibeta`

, you can use the following command.

```
mibeta depvar indepvar1 indepvar2, fisherz
*Saving R2
local r2mi= e(r2_mi)
*ADJ. R2
local ar2mi= e(r2_adj_mi)
mi estimate, dots post: reg depvar indepvar1 indepvar2
*Output using outreg2
outreg2 using regression.xls, replace excel dec(2) alpha(0.001, 0.01, 0.05) addstat(dfModel, e(df_m_mi), dfError, e(df_r_mi), F, e(F_mi), R-squared, `r2mi', Adjusted R-squared, `ar2mi')
```

❤️🔥Learn more: mibeta command and statistical output – Statalist

## Sample code with sample data

```
webuse nhanes2, clear
** DV: hlthstat
** IV: female agegrp bmi race vitaminc porphyrn fhtatk
* Step 1. Identify variables to impute and their level of measurement
misstable summarize hlthstat female agegrp bmi race vitaminc porphyrn fhtatk // put the list of variabels in your model
codebook hlthstat female agegrp bmi race vitaminc porphyrn fhtatk
recode hlthstat (8=.)
// hlthstat: ordinal (likert)
// vitaminc: continuous
// porphyrn: continuous
// fhtatk: binary
* Step 2. Register variables to impute
mi set flong
mi register imputed hlthstat vitaminc porphyrn fhtatk // put the list of variables to impute
* Step 3. Implement Multiple Imputation Methods
mi impute chained (logit) fhtatk (ologit) hlthstat (reg) vitaminc porphyrn, add(10) rseed (53421)
* Step 4. Regressions with multiply imputed data
mi estimate: reg hlthstat i.female i.agegrp bmi i.race vitaminc porphyrn i.fhtatk
mivif // VIF
mi estimate: proportion hlthstat
mi estimate: proportion fhtatk
mi estimate: mean vitaminc
mi estimate: mean porphyrn
mibeta hlthstat i.female i.agegrp bmi i.race vitaminc porphyrn i.fhtatk, fisherz // for r-squared
```

### Resources

Multiple Imputation in Stata (wisc.edu)

- Deciding to Impute (wisc.edu)
- Creating Imputation Models (wisc.edu)
- Imputing (wisc.edu)
- Managing Multiply Imputed Data (wisc.edu)
- Estimating (wisc.edu)
- Examples (wisc.edu)
- Recommended Readings (wisc.edu)

Stata Guide: Multiple Imputation: Imputation Step (mwn.de)

How can I perform multiple imputation on longitudinal data using ICE? | Stata FAQ (ucla.edu)

Multiple Imputation in Stata (ucla.edu)

https://stats.oarc.ucla.edu/wp-content/uploads/2016/09/Missing-Data-Techniques_UCLA_Stata.pdf (highly recommended)

Missing Values Analysis and Multiple Imputation in Stata

Advanced Handling of Missing Data One-day Workshop

Strategies and Guidelines for Handling Missing Data in Social Work Research

Missing Data and Multiple Imputation Decision Tree

How can I get margins for a multiply imputed survey logit model? | Stata FAQ (ucla.edu)