# [Stata] Data cleaning 6: Exploring missing data and creating analytic sample (misstable, mdesc, and keep/drop if)

One way to explore missing data and create an analytic sample using Stata is to use the commands `misstable`

, `mdesc`

, and `drop`

.

Here is an example of how to do that using the `nhanes2`

dataset from the Centers for Disease Control and Prevention. First, we can load the dataset from the Stata website using the command `webuse nhanes2`

. This dataset contains information on health and nutrition for 27,801 persons from 6 months to 74 years of age. It has 25 variables, some of which have missing values.

`webuse nhanes2`

`misstable`

command

Next, we can use the command `misstable summarize`

to get a table with counts of missing values for each variable. This command also shows the number of extended missing values (.a, .b, etc.) that are treated as nonmissing by default.

```
misstable summarize // show list of variables with missing values
misstable summarize var1 var2 var3 var4 ... // show missing values for specific set of vars
```

For example, we can specify the list of variables to get a summary of missing values for them. The output will look something like this. You can see only `highlead`

variable has missing values (N=5,403) and non-missing values (N=4,948)).

If we want to see the pattern of missing values across multiple variables, we can use the command `misstable patterns`

. This command shows how many observations have a certain combination of missing and nonmissing values for the specified variables. For example, we can type `misstable pattern`

to get a table of missing-value patterns for this set of variables. The output will look something like this. You can see that 88% of cases are complete (without missing values for three variables) and 11% of cases are missing in `hlthstat`

and `corpusc1`

.

`misstable patterns var1 var2 var3 `

The `generate(_miss)`

part of the command creates new binary variables with the prefix `_miss`

that have the value 1 if the corresponding original variable has a missing value and 0 otherwise. For example, if you type `misstable summarize age sex race income, generate(_miss)`

, you will get four new variables: `_miss_age`

, `_miss_sex`

, `_miss_race`

, and `_miss_income`

. These variables can be useful for identifying observations with missing values or for creating dummy variables for missingness in regression models.

`misstable summarize, generate(_miss)`

`egen rowmiss`

command

egen rowmiss command does the same with misstable, generate(_miss). You can create the variables for **the count of **the variables with missing values.

```
***** Create a variables that counts the number of variables for analysis (iv + dv + covariates) that are missing
egen sample_nmiss=rowmiss(var1 var2 var3 var4 var5)
***** Create a flag to identify those in the analytic sample.
// This is those with no missing observations on any of the variables for analysis (sample_nmiss)
generate flag_sample=1 if sample_nmiss==0
replace flag_sample=0 if sample_nmiss>=1
label variable flag_sample "Analytic sample"
tab flag_sample
```

`mdesc`

command

Alternatively, we can use the command `mdesc`

to get a similar table with counts of missing and nonmissing values for each variable. This command also shows the **percentage of missing values**. For example, we can type `mdesc age sex race`

to get a summary of missing and nonmissing values for these four variables. The output for `mdesc`

for all variables will look something like this.

```
ssc install mdesc
mdesc
mdesc var1 var2 var3 var4
```

`missingplot`

command: visualizing missing values distribution

One way to explore the missingness patterns in your data is to use the `missingplot`

command in Stata. This command, developed by Nicholas J. Cox, allows you to draw a plot showing the incidence of missing values in one or more variables in your dataset. The plot can help you to see the distribution and structure of missing values across observations and variables, and to identify potential outliers or anomalies.

To use the missingplot command, you can just type missigplot or specify one or more variables as arguments.

```
ssc install missingplot
missingplot
missingplot var1 var2 var3
```

### How to deal with missing data

When deciding how to deal with missing data, two important things to consider are 1) whether the missing data are systematic (not at random) or non-systematic (at random) and 2) how much of the data (observations) are missing. Depending on these factors, different methods such as listwise deletion, pairwise deletion, single imputation, or multiple imputation can be applied to deal with the missing data.

- Listwise deletion:
**This method removes any row or observation that contains at least one missing value.**- Pros: It is simple and easy to implement.
- Cons: It can reduce the sample size and introduce bias if the missing data are not completely random.

- Pairwise deletion:
**This method removes only the specific cells that contain missing values, and uses the remaining data to perform the analysis**.- Pros: It preserves more information than listwise deletion (and still simple and easy).
- Cons: It can create inconsistencies and complicate the estimation of covariance and correlation matrices.

- Single imputation:
**This method fills in the missing values with plausible estimates based on other available information**. For example, one can use the**mean, median, mode, or regression prediction**to impute the missing values.- Pros: It is still simple and easy to implement and does not lose data points compared to listwise and pairwise deletion.
- Cons: However, a single imputation can underestimate the uncertainty and variability in the data and introduce bias in the statistical inference.

- Multiple imputation:
**This method generates multiple possible values for each missing value, based on a probability distribution or a model that accounts for the observed data and the missingness mechanism**. Then, it performs the analysis on each of the completed datasets and combines the results into a final estimate and confidence interval.- Pros: Multiple imputation can provide valid and efficient statistical inference.
- Cons: It requires more computational resources and assumptions than single imputation (but we are okay; we have computers!).

**When it is OKAY to use listwise/pairwise deletion**: If the data is **missing at random** (MAR) and the incomplete cases are **less than 10% of observations**, the rule of thumb is that you are okay to use listwise deletion or pairwise deletion. Here are more established guidelines on cutoff based on the % of missing data.

Young et al. (2011) summarized the recommendations of various studies to provide the following guidelines:

- less than 1% missing data: the effect of missing data handling methods is trivial
- 1% to 5% missing data: simple methods such as
**listwise deletion**and**regression imputation**work well - 5% to 15% missing data: sophisticated methods, such as
**multiple imputations**would be selected; - missing data exceeds 15%:
**imputation results are largely meaningless**regardless of the imputation method used because very little can be said about the mechanism through which data are missing.

ALL methods assume that the data are **missing at random (MAR)**, which means that the probability of missing data is unrelated to the value of the missing data itself. If the data are **missing not at random (MNAR),** which means that the probability of missing data is related to the value of the missing data itself, then both methods can produce biased and invalid results. For example, what if a discrimination variable is missing in 50% of cases of females but only 10% of males? If you remove them with listwise deletion, it will lead to the bias.

⭐ Wait. **How can I know if the missing is at random or not at random? Here are more resources.**

You can check this resource (slide) to learn more about missing values in Stata: https://statisticalhorizons.com/wp-content/uploads/MD-Stata-Sample-Materials-1.pdf

Further, Multiple Imputation in Stata: (wisc.edu) webpage provides the method to assess the randomness of missing values using the t-tests and logistic regression with the loop.

**If the proportion of incomplete cases is low (less than 10%),** then both methods may produce similar results and have minimal impact on the statistical inference. However, if the proportion is high (more than 10%), then both methods may lose a lot of information and introduce a lot of error.

⭐ Ref: Langkamp, D. L., Lehman, A., & Lemeshow, S. (2010). Techniques for handling missing data in secondary analyses of large surveys. *Academic pediatrics*, *10*(3), 205-210.

- This study suggests a 10% cutoff for % of cases with missing values.
- There are more strict cutoffs suggesting 5%: Cook, R. M. (2021). Addressing missing data in quantitative counseling research.
*Counseling Outcome Research and Evaluation*,*12*(1), 43-53.

`drop if`

and `keep if`

command: creating analytic sample

Finally, if we want to create an analytic sample that excludes observations with missing values for certain variables (listwise deletion), we can use the command `drop if`

. This command deletes observations or variables from the dataset if the condition is met (after if). For example, if we want to drop observations that have missing values for the variable `vitaminc`

, we can type `drop if vitaminc == .`

. This will delete 378 number of observations. Or, you can use the `keep if`

command instead, to keep specific cases only.

```
drop if vitaminc==.
keep if race==2
```

## More references

Missing Values | Stata Learning Modules (ucla.edu)

Handling missing data in Stata: Imputation and likelihood-based approaches (pdf)

Multiple Imputation in Stata: (wisc.edu)

SAGE Green book on Missing Data | SAGE Publications Ltd