[Stata] Data cleaning 3: Recoding / reverse-coding variable (clonevar, gen, replace, recode, and labrec)

Steps to Creating Variables

1. First, investigate the existing data. Check the codebook or fre. The codebook command shows basic information on variables (e.g., type, range, unique values, number of missing).

Compared to tab command, fre command returns the frequency table WITH the values with their labels.

Stata
ssc install fre 
fre varname 

2. Then, make a plan on how to construct the variable. Would you love to reverse-code it? Or would you love to change it from a 7-point Likert scale to 5 or 3 categories? It’s great to write it down in the do-file.

🤔 What is reverse-coding? Click if you are not sure what it is!

Reverse coding is a process that involves transforming the values or categories of a variable in the opposite direction. It is often used to make the data interpretation easier or more consistent. For example, if you have a survey question that asks how much you agree or disagree with a statement, you might assign 1 to “Strongly Agree” and 5 to “Strongly Disagree”. However, if you have another question that is worded in the opposite way, you might want to reverse the values so that both questions measure the same thing. For instance, if you have these two questions:

  • I like to work in a team. (1 = Strongly Agree, 5 = Strongly Disagree)
  • I prefer to work alone. (1 = Strongly Agree, 5 = Strongly Disagree)

You can see that the first question measures extroversion and the second question measures introversion. To make them consistent, you can reverse the values of the second question so that 1 becomes 5, 2 becomes 4, 3 stays the same, 4 becomes 2, and 5 becomes 1. Then both questions measure extroversion and you can compare or combine them more easily.

For example, the reverse-coding will convert from the upper image to the lower image.

3. The next step is to copy the variable or generate it in order not to lose the original variable.

Stata
ssc install clonevar // install clonevar if it is not installed
clonevar newvar originalvar // clonevar allows us to copy varialbe WITH labels 
gen newvar = originalvar // this command only copy values WITHOUT labels 

The gen command creates a new variable based on specified conditions.

The replace command assigns new values for an existing variable based on specified conditions.

Remember always to code your variables intuitively. The standard is always to have 1=yes and 0=no. If you create a variable named gender, what would 1 mean? That is unclear. So here male=1 means the respondent is a male.

The next step is finally you will recode values, with label define and label values command. Here, you can follow the examples using the STATA-provided sample dataset.

Stata
webuse nhanes2

Example 1: Recoding categories (categorical to categorical)

Let’s say I would love to create another categorical variable with agegrp variable. Now, it has 6 categories, but we can create another agegrp variable (agegrp2) with three categories: (1) 20-39, (2) 40-69, and (3) 70+. You can make the plan in the table or the note in the code.

Labelagegrpagegrp2 (recoded)
20-2911
30-3921
40-4932
50-5942
60-6953
70+63

We can recode it based on the plan table.

Using recode command: recoding categories within one variable

Stata
gen agegrp2 = agegrp // I used gen instead of clonevar since we don't need the original label
recode agegrp2 (2=1)(3=2)(4=2)(5=3)(6=3)
tab agegrp2 agegrp // check if it is well-recoded to the new group 

Okay, it seems like it is recoded into the new 3-category variable! The last step is to assign new variable name and new value labels.

Stata
label variable agegrp2 "Age Group - 3 categories"
label define agegrp2 1 "20-39" 2 "40-69" 3 "70+" 
label values agegrp2 agegrp2 

Now, you can see the new categorical variable is well-defined! Mission complete 🙌

Using replace command: creating a new variable with more complicated rules

If you would love to create a new variable using multiple variables, you need to use gen and replace instead of recode, since recode only allows us to recode within one variable. For example, in nhanes2 sample data, we can try creating new categories: race_gender using the following two variables: race and sex.

We can plan it as follows.

LabelValue
White Male1
White Female2
Black Male3
Black Female4
Other Male5
Other Female6

In this kind of case, we need to create the new variable with gen and replace commands. You need to use gen for the first time creating a variable. It will return the output “___ missing values generated” since you have not replaced the other categories yet. You can ignore it and keep replacing it to the end. Then use replace after generating the variable.

Stata
gen race_gender = 1 if race == 1 & sex == 1 
  replace race_gender = 2 if race == 1 & sex == 2 
  replace race_gender = 3 if race == 2 & sex == 1 
  replace race_gender = 4 if race == 2 & sex == 2 
  replace race_gender = 5 if race == 3 & sex == 1 
  replace race_gender = 6 if race == 3 & sex == 2 

Now, you can check that the variable is not well-coded for the race-by-gender variable. The final step is to label the variable and values according to the plan.

Stata
label variable race_gender "Race by gender category"
label define race_gender  1 "White Male" 2 "White Female" 3 "Black Male" 4 "Black Female"  5 "Other Male"  6 "Other Male" 
label values race_gender race_gender 

Whit fre command, you can check that the variable is well-coded with labels🙂

Tip. Operators in STATA

In creating variables with gen and replace, the following operations will be used. In STATA, == means equal to that value. Relational operators will be used when you deal with continuous variables.

Stata
	// These operators will be useful in creating variables. 

	/*                                                         Relational
         Arithmetic              Logical            (numeric and string)
    --------------------     ------------------     ---------------------
     +   addition                &   and               >   greater than
     -   subtraction             |   or                <   less than
     *   multiplication          !   not               >=  > or equal
     /   division                ~   not               <=  < or equal
     ^   power                                         ==  equal
     -   negation                                      !=  not equal
     +   string concatenation                          ~=  not equal          */

Example 2: Reverse coding

In the example, the variable needs to be reverse-coded. If you would love to reverse code or recode the order of the variable, you need to do it as follows by using label define and label values.

Stata
la def likert5 1 "Poor" 2 "Fair" 3 "Good" 4 "Very Good" 5 "Excellent"
la val varname likert5

Then, it will return the reverse-coded item but it needs to be recoded in the values as well.

Stata
recode varname (5=1)(4=2)(3=3)(2=4)(1=5)

Finally, we can see the variable is well reverse coded with the label and correct values.

Tip. labrec command for reverse-coding

However, it is a bit confusing and complicated to do it with multiple lines mixing up the value labels. You can do it in one line, with a user-created command labrec, which allows us to recode WITH its assigned labels.

Stata
ssc install labrec // install it for the fisrt time 
labrec varname (5=1)(4=2)(3=3)(2=4)(1=5)

This command reverse-coded the variable only in one line, without label define and label values steps! This is really handy 🙂

  • June 3, 2023