[Stata] Creating M-1SD, M, M+1SD ordinal variables from continuous variables

In this blog post, I will show you how to create three-level ordinal variables (low, medium, and high) from continuous variables in Stata using the mean and standard deviation as cut-off points. This can be useful for various purposes, such as creating dummy variables for regression analysis, grouping observations into different levels of a factor, or visualizing the distribution of a variable like this example.

Example data

Let’s use the nhanes2 dataset that comes with Stata as an example. Suppose we are interested in the variable albumin, which measures the serum albumin (g/DL) in the blood test. We can use the summarize command to get some descriptive statistics of this variable:

webuse nhanes2 
sum albumin 

The output shows that the mean of albumin is 4.67 and the standard deviation is 0.33. We can also use the histogram command to plot the distribution of albumin:

hist albumin, freq

The histogram shows that the distribution of albumin is roughly symmetric (seemingly normal distribution), with most values clustered around the mean.

Creating ordinal variables

Now, let’s create a categorical variable called albumin_cat that has three levels: low, medium, and high. We want to assign each observation to one of these levels based on its albumin value, using the following rules:

  • If albumin is less than or equal to the mean minus one standard deviation (M-1SD), then albumin_cat is low.
  • If albumin is greater than M-1SD and less than or equal to the mean plus one standard deviation (M+1SD), then albumin_cat is medium.
  • If albumin is greater than M+1SD, then albumin_cat is high.

To implement these rules in Stata, we can use the generate and replace commands. You can use this command to implement it. Please not that sum variable command needs to be run before doing this. If not, the new variable will not be created appropriately. You an change this command for your variable name 🪄

sum albumin
gen albumin_cat = 1 if albumin <= (r(mean)-r(sd)) 
replace albumin_cat = 2 if albumin > (r(mean)-r(sd)) & albumin < (r(mean)+r(sd))
replace albumin_cat = 3 if albumin >= (r(mean)+r(sd)) & albumin!=. 

Then, you can check whether they are grouped well or not using the tab command (cross-tabulation), including missing values.

tab albumin albumin_cat, m

Finally, we can use the label variablelabel define and label values commands to assign meaningful labels to the values of albumin_cat. Then, you can check if it is labeled well or not using tab or fre command for univariate frequency table 🙌

label define albumin_cat_lab 1 "low" 2 "medium" 3 "high"
label values albumin_cat albumin_cat_lab 
label variable albumin_cat "serum albumin (g/DL) - 3 categories" 

  • June 18, 2023