[Stata] Data cleaning 3: Recoding / reverse-coding variable (clonevar, gen, replace, recode, and labrec)
Steps to Creating Variables
1. First, investigate the existing data. Check the codebook
or fre
. The codebook
command shows basic information on variables (e.g., type, range, unique values, number of missing).
Compared to tab
command, fre
command returns the frequency table WITH the values with their labels.
ssc install fre
fre varname
2. Then, make a plan on how to construct the variable. Would you love to reverse-code it? Or would you love to change it from a 7-point Likert scale to 5 or 3 categories? It’s great to write it down in the do-file.
3. The next step is to copy the variable or generate it in order not to lose the original variable.
ssc install clonevar // install clonevar if it is not installed
clonevar newvar originalvar // clonevar allows us to copy varialbe WITH labels
gen newvar = originalvar // this command only copy values WITHOUT labels
The gen
command creates a new variable based on specified conditions.
The replace
command assigns new values for an existing variable based on specified conditions.
Remember always to code your variables intuitively. The standard is always to have 1=yes and 0=no. If you create a variable named gender
, what would 1 mean? That is unclear. So here male=1 means the respondent is a male.
The next step is finally you will recode values, with label define
and label values
command. Here, you can follow the examples using the STATA-provided sample dataset.
webuse nhanes2
Example 1: Recoding categories (categorical to categorical)
Let’s say I would love to create another categorical variable with agegrp
variable. Now, it has 6 categories, but we can create another agegrp variable (agegrp2) with three categories: (1) 20-39, (2) 40-69, and (3) 70+. You can make the plan in the table or the note in the code.
Label | agegrp | agegrp2 (recoded) |
20-29 | 1 | 1 |
30-39 | 2 | 1 |
40-49 | 3 | 2 |
50-59 | 4 | 2 |
60-69 | 5 | 3 |
70+ | 6 | 3 |
We can recode it based on the plan table.
Using recode
command: recoding categories within one variable
gen agegrp2 = agegrp // I used gen instead of clonevar since we don't need the original label
recode agegrp2 (2=1)(3=2)(4=2)(5=3)(6=3)
tab agegrp2 agegrp // check if it is well-recoded to the new group
Okay, it seems like it is recoded into the new 3-category variable! The last step is to assign new variable name and new value labels.
label variable agegrp2 "Age Group - 3 categories"
label define agegrp2 1 "20-39" 2 "40-69" 3 "70+"
label values agegrp2 agegrp2
Now, you can see the new categorical variable is well-defined! Mission complete 🙌
Using replace
command: creating a new variable with more complicated rules
If you would love to create a new variable using multiple variables, you need to use gen and replace
instead of recode
, since recode
only allows us to recode within one variable. For example, in nhanes2
sample data, we can try creating new categories: race_gender
using the following two variables: race
and sex
.
We can plan it as follows.
Label | Value |
White Male | 1 |
White Female | 2 |
Black Male | 3 |
Black Female | 4 |
Other Male | 5 |
Other Female | 6 |
In this kind of case, we need to create the new variable with gen and replace commands. You need to use gen
for the first time creating a variable. It will return the output “___ missing values generated” since you have not replaced the other categories yet. You can ignore it and keep replacing it to the end. Then use replace
after generating the variable.
gen race_gender = 1 if race == 1 & sex == 1
replace race_gender = 2 if race == 1 & sex == 2
replace race_gender = 3 if race == 2 & sex == 1
replace race_gender = 4 if race == 2 & sex == 2
replace race_gender = 5 if race == 3 & sex == 1
replace race_gender = 6 if race == 3 & sex == 2
Now, you can check that the variable is not well-coded for the race-by-gender variable. The final step is to label the variable and values according to the plan.
label variable race_gender "Race by gender category"
label define race_gender 1 "White Male" 2 "White Female" 3 "Black Male" 4 "Black Female" 5 "Other Male" 6 "Other Male"
label values race_gender race_gender
Whit fre
command, you can check that the variable is well-coded with labels🙂
Tip. Operators in STATA
In creating variables with gen and replace, the following operations will be used. In STATA, ==
means equal to that value. Relational operators will be used when you deal with continuous variables.
// These operators will be useful in creating variables.
/* Relational
Arithmetic Logical (numeric and string)
-------------------- ------------------ ---------------------
+ addition & and > greater than
- subtraction | or < less than
* multiplication ! not >= > or equal
/ division ~ not <= < or equal
^ power == equal
- negation != not equal
+ string concatenation ~= not equal */
Example 2: Reverse coding
In the example, the variable needs to be reverse-coded. If you would love to reverse code or recode the order of the variable, you need to do it as follows by using label define and label values.
la def likert5 1 "Poor" 2 "Fair" 3 "Good" 4 "Very Good" 5 "Excellent"
la val varname likert5
Then, it will return the reverse-coded item but it needs to be recoded in the values as well.
recode varname (5=1)(4=2)(3=3)(2=4)(1=5)
Finally, we can see the variable is well reverse coded with the label and correct values.
Tip. labrec
command for reverse-coding
However, it is a bit confusing and complicated to do it with multiple lines mixing up the value labels. You can do it in one line, with a user-created command labrec
, which allows us to recode WITH its assigned labels.
ssc install labrec // install it for the fisrt time
labrec varname (5=1)(4=2)(3=3)(2=4)(1=5)
This command reverse-coded the variable only in one line, without label define and label values steps! This is really handy 🙂