[Stata] Creating M-1SD, M, M+1SD ordinal variables from continuous variables
In this blog post, I will show you how to create three-level ordinal variables (low, medium, and high) from continuous variables in Stata using the mean and standard deviation as cut-off points. This can be useful for various purposes, such as creating dummy variables for regression analysis, grouping observations into different levels of a factor, or visualizing the distribution of a variable like this example.
Example data
Let’s use the nhanes2
dataset that comes with Stata as an example. Suppose we are interested in the variable albumin
, which measures the serum albumin (g/DL) in the blood test. We can use the summarize
command to get some descriptive statistics of this variable:
webuse nhanes2
sum albumin
The output shows that the mean of albumin
is 4.67 and the standard deviation is 0.33. We can also use the histogram
command to plot the distribution of albumin
:
hist albumin, freq
The histogram shows that the distribution of albumin
is roughly symmetric (seemingly normal distribution), with most values clustered around the mean.
Creating ordinal variables
Now, let’s create a categorical variable called albumin_cat
that has three levels: low, medium, and high. We want to assign each observation to one of these levels based on its albumin
value, using the following rules:
- If
albumin
is less than or equal to the mean minus one standard deviation (M-1SD), thenalbumin_cat
is low. - If
albumin
is greater than M-1SD and less than or equal to the mean plus one standard deviation (M+1SD), thenalbumin_cat
is medium. - If
albumin
is greater than M+1SD, thenalbumin_cat
is high.
To implement these rules in Stata, we can use the generate
and replace
commands. You can use this command to implement it. Please not that sum variable
command needs to be run before doing this. If not, the new variable will not be created appropriately. You an change this command for your variable name 🪄
sum albumin
gen albumin_cat = 1 if albumin <= (r(mean)-r(sd))
replace albumin_cat = 2 if albumin > (r(mean)-r(sd)) & albumin < (r(mean)+r(sd))
replace albumin_cat = 3 if albumin >= (r(mean)+r(sd)) & albumin!=.
Then, you can check whether they are grouped well or not using the tab
command (cross-tabulation), including missing values.
tab albumin albumin_cat, m
Finally, we can use the label variable
, label define
and label values
commands to assign meaningful labels to the values of albumin_cat
. Then, you can check if it is labeled well or not using tab
or fre
command for univariate frequency table 🙌
label define albumin_cat_lab 1 "low" 2 "medium" 3 "high"
label values albumin_cat albumin_cat_lab
label variable albumin_cat "serum albumin (g/DL) - 3 categories"