[Stata] Data cleaning 5: Working with scales and creating summary variables (gen, egen, and alpha)

Scales are composite measures that combine multiple items or questions into a single score. They are often used in social science research to measure latent constructs such as attitudes, beliefs, or preferences. For example, a researcher might use a scale to measure the level of life satisfaction or depression (using PHQ-9, for example).

To create and analyze scales in Stata, we need to use some commands that can manipulate variables and calculate summary statistics. In this post, we will introduce three such commands: gen, egen, and alpha with an example of the GAD-7 scale.

The GAD-7 scale is a popular tool for assessing anxiety. It consists of seven items that measure the frequency of various symptoms of generalized anxiety disorder in the past two weeks. Each item is rated on a 4-point scale from 0 (not at all) to 3 (nearly every day). The total score ranges from 0 to 21, with higher scores indicating more severe anxiety.

gen command

We can use the gen command to create new variables based on existing ones. For example, we can create a variable that sums up the seven items to get the total score:

gen anxiety = gad1 + gad2 + gad3 + gad4 + gad5 + gad6 + gad7 // for total score
gen anxiety = (gad1 + gad2 + gad3 + gad4 + gad5 + gad6 + gad7)/7 //for average scroe

You can check the distribution of the composite score using histogram, codebook, or sum commands.

egen command

We can also use the egen command to create new variables based on some function of existing ones. For example, we can use the rowmean function to create a variable that calculates the average score of the seven items:

egen anxiety = rowtotal(gad1-gad7) // for total score
egen anxiety = rowmean(gad1-gad7) // for average scroe 

Tip. Difference between gen and egen

The difference between gen and egen in terms of dealing with missing values is that gen treats missing values as the largest possible value, while egen has various options to handle missing values depending on the function used

The rowmean() function of egen calculates the mean of the values in each row, ignoring any missing values. For example, if you have a dataset with three variables x, y, and z, and you want to create a new variable that is the mean of x, y, and z for each observation, the following two commands will return the same number of missing values in the composite score.

egen meanxyz = rowmean(x y z)
gen meanxyz = (x + y + z) / 3 if !missing(x, y, z)

The rowtotal() function of egen has a missing option that allows the user to specify how to treat missing values in the sum. The default behavior of rowtotal() is to return a missing value if all the values in the row are missing, and to ignore any missing values otherwise. The count() function of egen also has a missing option that allows the user to include or exclude missing values in the count. The default behavior of count() is to exclude missing values.

In general, egen provides more flexibility and control over how to deal with missing values than gen, but it also depends on the specific function and options used. I personally prefer to use gen command to be more conservative in terms of dealing with missing values, so that I can decide how to deal with them later. It is advisable to check the documentation of each egen function for details on how it handles missing values.

alpha command

We can use the alpha command to compute the internal consistency and Cronbach’s alpha for the scale formed from the seven items. Cronbach’s alpha is a measure of internal consistency or reliability of a scale. It ranges from 0 to 1, with higher values indicating more reliable scales.

alpha gad1-gad7, item label

The output also shows a table with variable labels and statistics for each item. We can see that all items have high item-test and item-rest correlations, indicating that they are measuring the same construct. We can also see that dropping any item would lower Cronbach’s alpha, suggesting that all items are contributing to the reliability of the scale.

  • June 5, 2023