# [Stata] Creating M-1SD, M, M+1SD ordinal variables from continuous variables

In this blog post, I will show you how to create three-level ordinal variables (low, medium, and high) from continuous variables in Stata using the mean and standard deviation as cut-off points. This can be useful for various purposes, such as creating **dummy variables for regression analysis**, **grouping observations into different levels of a factor**, or **visualizing the distribution of a variable** like this example.

## Example data

Let’s use the `nhanes2`

dataset that comes with Stata as an example. Suppose we are interested in the variable `albumin`

, which measures the serum albumin (g/DL) in the blood test. We can use the `summarize`

command to get some descriptive statistics of this variable:

```
webuse nhanes2
sum albumin
```

The output shows that the mean of `albumin`

is 4.67 and the standard deviation is 0.33. We can also use the `histogram`

command to plot the distribution of `albumin`

:

`hist albumin, freq`

The histogram shows that the distribution of `albumin`

is roughly symmetric (seemingly normal distribution), with most values clustered around the mean.

## Creating ordinal variables

Now, let’s create a categorical variable called `albumin_cat`

that has three levels: low, medium, and high. We want to assign each observation to one of these levels based on its `albumin`

value, using the following rules:

- If
`albumin`

is less than or equal to the mean minus one standard deviation (M-1SD), then`albumin_cat`

is low. - If
`albumin`

is greater than M-1SD and less than or equal to the mean plus one standard deviation (M+1SD), then`albumin_cat`

is medium. - If
`albumin`

is greater than M+1SD, then`albumin_cat`

is high.

To implement these rules in Stata, we can use the `generate`

and `replace`

commands. You can use this command to implement it. **Please not that sum variable command needs to be run before doing this.** If not, the new variable will not be created appropriately. You an change this command for your variable name 🪄

```
sum albumin
gen albumin_cat = 1 if albumin <= (r(mean)-r(sd))
replace albumin_cat = 2 if albumin > (r(mean)-r(sd)) & albumin < (r(mean)+r(sd))
replace albumin_cat = 3 if albumin >= (r(mean)+r(sd)) & albumin!=.
```

Then, you can check whether they are grouped well or not using the `tab`

command (cross-tabulation), including missing values.

`tab albumin albumin_cat, m`

Finally, we can use the `label variable`

, `label define`

and `label values`

commands to assign meaningful labels to the values of `albumin_cat`

. Then, you can check if it is labeled well or not using `tab`

or `fre`

command for univariate frequency table 🙌

```
label define albumin_cat_lab 1 "low" 2 "medium" 3 "high"
label values albumin_cat albumin_cat_lab
label variable albumin_cat "serum albumin (g/DL) - 3 categories"
```