[Stata] Data cleaning 6: Exploring missing data and creating analytic sample (misstable, mdesc, and keep/drop if)

One way to explore missing data and create an analytic sample using Stata is to use the commands misstablemdesc, and drop

Here is an example of how to do that using the nhanes2 dataset from the Centers for Disease Control and Prevention. First, we can load the dataset from the Stata website using the command webuse nhanes2. This dataset contains information on health and nutrition for 27,801 persons from 6 months to 74 years of age. It has 25 variables, some of which have missing values.

webuse nhanes2

misstable command

Next, we can use the command misstable summarize to get a table with counts of missing values for each variable. This command also shows the number of extended missing values (.a, .b, etc.) that are treated as nonmissing by default.

misstable summarize // show list of variables with missing values 
misstable summarize var1 var2 var3 var4 ... // show missing values for specific set of vars

For example, we can specify the list of variables to get a summary of missing values for them. The output will look something like this. You can see only highlead variable has missing values (N=5,403) and non-missing values (N=4,948)).

If we want to see the pattern of missing values across multiple variables, we can use the command misstable patterns. This command shows how many observations have a certain combination of missing and nonmissing values for the specified variables. For example, we can type misstable pattern to get a table of missing-value patterns for this set of variables. The output will look something like this. You can see that 88% of cases are complete (without missing values for three variables) and 11% of cases are missing in hlthstat and corpusc1.

misstable patterns var1 var2 var3 

The generate(_miss) part of the command creates new binary variables with the prefix _miss that have the value 1 if the corresponding original variable has a missing value and 0 otherwise. For example, if you type misstable summarize age sex race income, generate(_miss), you will get four new variables: _miss_age_miss_sex_miss_race, and _miss_income. These variables can be useful for identifying observations with missing values or for creating dummy variables for missingness in regression models.

misstable summarize, generate(_miss)

egen rowmiss command

egen rowmiss command does the same with misstable, generate(_miss). You can create the variables for the count of the variables with missing values.

*****	Create a variables that counts the number of variables for analysis (iv + dv + covariates) that are missing
egen sample_nmiss=rowmiss(var1 var2 var3 var4 var5)

*****	Create a flag to identify those in the analytic sample. 
		// This is those with no missing observations on any of the variables for analysis (sample_nmiss)
generate flag_sample=1 if sample_nmiss==0
	replace flag_sample=0 if sample_nmiss>=1
label variable flag_sample "Analytic sample"

tab flag_sample

mdesc command

Alternatively, we can use the command mdesc to get a similar table with counts of missing and nonmissing values for each variable. This command also shows the percentage of missing values. For example, we can type mdesc age sex race to get a summary of missing and nonmissing values for these four variables. The output for mdesc for all variables will look something like this.

ssc install mdesc 
mdesc var1 var2 var3 var4

missingplot command: visualizing missing values distribution

One way to explore the missingness patterns in your data is to use the missingplot command in Stata. This command, developed by Nicholas J. Cox, allows you to draw a plot showing the incidence of missing values in one or more variables in your dataset. The plot can help you to see the distribution and structure of missing values across observations and variables, and to identify potential outliers or anomalies.

To use the missingplot command, you can just type missigplot or specify one or more variables as arguments.

ssc install missingplot
missingplot var1 var2 var3 

How to deal with missing data

When deciding how to deal with missing data, two important things to consider are 1) whether the missing data are systematic (not at random) or non-systematic (at random) and 2) how much of the data (observations) are missing. Depending on these factors, different methods such as listwise deletion, pairwise deletion, single imputation, or multiple imputation can be applied to deal with the missing data.

  • Listwise deletion: This method removes any row or observation that contains at least one missing value. 
    • Pros: It is simple and easy to implement.
    • Cons: It can reduce the sample size and introduce bias if the missing data are not completely random.
  • Pairwise deletion: This method removes only the specific cells that contain missing values, and uses the remaining data to perform the analysis
    • Pros: It preserves more information than listwise deletion (and still simple and easy).
    • Cons: It can create inconsistencies and complicate the estimation of covariance and correlation matrices.
  • Single imputation: This method fills in the missing values with plausible estimates based on other available information. For example, one can use the mean, median, mode, or regression prediction to impute the missing values. 
    • Pros: It is still simple and easy to implement and does not lose data points compared to listwise and pairwise deletion.
    • Cons: However, a single imputation can underestimate the uncertainty and variability in the data and introduce bias in the statistical inference.
  • Multiple imputation: This method generates multiple possible values for each missing value, based on a probability distribution or a model that accounts for the observed data and the missingness mechanism. Then, it performs the analysis on each of the completed datasets and combines the results into a final estimate and confidence interval. 
    • Pros: Multiple imputation can provide valid and efficient statistical inference.
    • Cons: It requires more computational resources and assumptions than single imputation (but we are okay; we have computers!).

When it is OKAY to use listwise/pairwise deletion: If the data is missing at random (MAR) and the incomplete cases are less than 10% of observations, the rule of thumb is that you are okay to use listwise deletion or pairwise deletion. Here are more established guidelines on cutoff based on the % of missing data.

Young et al. (2011) summarized the recommendations of various studies to provide the following guidelines:

  1. less than 1% missing data: the effect of missing data handling methods is trivial
  2. 1% to 5% missing data: simple methods such as listwise deletion and regression imputation work well
  3. 5% to 15% missing data: sophisticated methods, such as multiple imputations would be selected;
  4. missing data exceeds 15%: imputation results are largely meaningless regardless of the imputation method used because very little can be said about the mechanism through which data are missing.

ALL methods assume that the data are missing at random (MAR), which means that the probability of missing data is unrelated to the value of the missing data itself. If the data are missing not at random (MNAR), which means that the probability of missing data is related to the value of the missing data itself, then both methods can produce biased and invalid results. For example, what if a discrimination variable is missing in 50% of cases of females but only 10% of males? If you remove them with listwise deletion, it will lead to the bias.

⭐ Wait. How can I know if the missing is at random or not at random? Here are more resources.
You can check this resource (slide) to learn more about missing values in Stata: https://statisticalhorizons.com/wp-content/uploads/MD-Stata-Sample-Materials-1.pdf
Further, Multiple Imputation in Stata: (wisc.edu) webpage provides the method to assess the randomness of missing values using the t-tests and logistic regression with the loop.

If the proportion of incomplete cases is low (less than 10%), then both methods may produce similar results and have minimal impact on the statistical inference. However, if the proportion is high (more than 10%), then both methods may lose a lot of information and introduce a lot of error.

⭐ Ref: Langkamp, D. L., Lehman, A., & Lemeshow, S. (2010). Techniques for handling missing data in secondary analyses of large surveys. Academic pediatrics10(3), 205-210.

  • This study suggests a 10% cutoff for % of cases with missing values.
  • There are more strict cutoffs suggesting 5%: Cook, R. M. (2021). Addressing missing data in quantitative counseling research. Counseling Outcome Research and Evaluation12(1), 43-53.

drop if and keep if command: creating analytic sample

Finally, if we want to create an analytic sample that excludes observations with missing values for certain variables (listwise deletion), we can use the command drop if. This command deletes observations or variables from the dataset if the condition is met (after if). For example, if we want to drop observations that have missing values for the variable vitaminc, we can type drop if vitaminc == .. This will delete 378 number of observations. Or, you can use the keep if command instead, to keep specific cases only.

drop if vitaminc==.
keep if race==2 

More references

Missing Values | Stata Learning Modules (ucla.edu)

Handling missing data in Stata: Imputation and likelihood-based approaches (pdf)

How can I see the number of missing values and patterns of missing values in my data file? | Stata FAQ (ucla.edu)

Multiple Imputation in Stata: (wisc.edu)

SAGE Green book on Missing Data | SAGE Publications Ltd

  • June 6, 2023