[Stata] Data cleaning 4: Creating binary/dichotomous, ordinal, and interval variables (tab, egen) 

One of the common tasks in data cleaning is to create new variables from existing ones, such as binary/dichotomous variables, ordinal variables, and interval variables. In this post, we will show how to use Stata commands tab and egen to create these types of variables from the webuse nhanes2 dataset.

The webuse nhanes2 dataset contains data from the second National Health and Nutrition Examination Survey (NHANES II), conducted in the United States from 1976 to 1980. It has information on 10,351 individuals, including their demographic characteristics, health status, and laboratory results. You can load the dataset in Stata by typing:

Stata
webuse nhanes2

Continuous variable -> Binary variable

To create a binary/dichotomous variable, we need to assign a value of 0 or 1 to each observation based on a condition. For example, suppose we want to create a variable called anemia that indicates whether an individual has animia or not. We can use the existing variable hgb, which measures the hemoglobin level in the blood. According to the World Health Organization, a hgb level less than 12g/dL indicates anemia. Therefore, we can use the following command to create the anemia variable:

Stata
gen anemia = 1 if hgb < 12
replace anemia = 0 if hgb >= 12

This command will assign a value of 1 to anemia if hgb is less than 12g/dL, and a value of 0 otherwise. We can use the fre or tab command to check the frequency distribution of the new variable:

Categorical variable -> Binary variable: Dummy coding

Creating binary variables from categorical variables, which is called dummy coding, is much easier in Stata. Dummy coding is a way of representing a categorical variable as binary variables of each category, where each binary variable indicates the presence or absence of a certain level. For example, suppose we want to create dummy variables from the variable race, which has three levels: white, black, and other. We can use the following command to create three dummy variables: whiteblack, and other, where each variable has a value of 1 if the individual belongs to that race, and 0 otherwise.

Stata
tab race, gen(racedum)

Then, it will create the variables named, racedum1, racedum2, and racedum3 at the end of the variable window. You can change the prefix of the dummy variable as you see it is intuitive.

Continuous variable -> Ordinal Variable

To create an ordinal variable, we need to assign a value that reflects the order or rank of each observation based on a criterion.

Example 1. Based on known cut-off scores

For example, suppose we want to create a variable called bmi_cat that categorizes the body mass index (BMI) of each individual into four groups: underweight, normal weight, overweight, and obese. We can use the existing variable bmi, which measures the BMI in kg/m. According to the World Health Organization, a BMI below 18.5 indicates underweight, a BMI between 18.5 and 25 indicates normal weight, a BMI between 25 and 30 indicates overweight, and a BMI above 30 indicates obese. Therefore, we can use the following command to create the bmi_cat variable:

Stata
gen bmi_cat = .
replace bmi_cat = 1 if bmi < 18.5
replace bmi_cat = 2 if bmi >= 18.5 & bmi < 25
replace bmi_cat = 3 if bmi >= 25 & bmi < 30
replace bmi_cat = 4 if bmi >= 30

This command will first create a missing value for bmi_cat for all observations, and then replace it with a value of 1, 2, 3, or 4 depending on the BMI range. We can use the fre or tab command again to check the frequency distribution of the new variable:

We can see that about half of the individuals in the dataset have average weight, while about one-third are overweight and one-sixth are obese.

Example 2. Using the distribution and egen command

The other method you can try if there is no known (or validated) cut-off score, to create the ordinal variable is using “egen” to create an ordinal variable based on a continuous variable, based on the distribution of the original variable (here, bmi). You can specify the number of categories in group() option. Then, I generally explore min, max, and mean using tabstat how the categories are distributed. Yoo can see the fre returned (almost) equally distributed frequency for a category.

Stata
egen bmi_cat2 = cut(bmi), group(4)
tabstat bmi, stats(min max mean) by(bmi_cat2)

Continuous variable -> Interval Variable

To create an interval variable, we need to assign a value that reflects the distance or difference between each observation and a reference point based on a scale. For example, suppose we want to create a variable called age_group that groups the age of each individual into five-year intervals. We can use the existing variable age, which measures the age in years. We can use the egen command with the cut option to create the age_group variable:

Stata
egen age_group = cut(age), at(20(5)80)

This command will create a variable called age_group that has values of 20, 25, 30, …, 80, corresponding to the intervals [20,25), [25,30), …, [75,80). The at option specifies the cut points for the intervals, and the notation 20(5)80 means starting from 20, increasing by 5, until 80. You can use the tab command once more to check the frequency distribution of the new variable. We can see that most of the individuals in the dataset are between the ages of [25,60), with a few outliers above or below that range.

  • June 4, 2023