[Stata] Data cleaning 7: Working with string variables (destring, tostring, encode, and decode)
Data preparation is often said to occupy 80% of the data analysis process. Ensuring that your data is clean, accurate, and in the right format is crucial before performing any statistical analysis. For those using Stata, managing and cleaning string variables (text data) can initially seem challenging, but with several commands, it becomes a smooth process.
This blog post will delve into four Stata commands that will simplify the way you handle text data: destring
, tostring
, encode
, and decode
. Whether you’re a Stata newbie or need a quick refresher, this guide is here to assist you.
What are string variables?
String variables are essentially sequences of characters. They can contain anything from letters, numbers, and spaces to other special characters. In Stata, string variables are easily identifiable when you open the data browser or run codebook
commands. Here, we are going to use the example by STATA “hbp2.dta”
webuse hbp2.dta
In the data browser, the string variables are in wine color, compared to the numeric variables in black colors.
When you run the command codebook, it will also show whether the variable is string or numeric.
codebook varname
The number after str in the type shows the maximum length of the characters in that specific variable (e.g., str6, str12 or str50).
Why can’t I directly analyze my string data?
- While strings are flexible and can store a variety of data, Stata and many other statistical software prefer numerical data for data analysis. If you try to run the commands such as regression with string variables, it will return the error message. Hence, converting or managing string variables becomes important.
The destring
Command
The destring command converts string variables that represent numbers into numeric variables. This is particularly useful when data is imported, and numbers are mistakenly read as strings, even though all the text in the variable are numbers.
For example, here, id
includes only numbers in the values, so it would be better to convert them to the numeric variables, such as schno
.
destring varname, replace // replace string variables in varlist with numeric variables
destring varname, generate(newvarname) // generate newvarname seperately
If it is successful, it will return the following message:
Tip. Contains non-numeric characters error
If there are characters within your string data that can’t be converted into numbers, Stata will show an error.
- Solution 1: Use the
ignore
() option. For example,destring income, replace ignore("$,")
if the text is formatted “$10,000” - Solution 2: Use the
force
option. It will destring it by converting the values with non-numeric characters into missing values.
destring varname, replace force
destring varname, replace ignore(" ") // Remove the spaces in varname and convert it to a numeric variable, replacing the original string variable. You can put other text in ignore option.
You can also check if it is converted to Numeric well, by using the command codebook
.
The tostring
Command
Converting numerical data to string format: Why and when? Sometimes, numbers are better treated as string values (like phone numbers or zip codes).
tostring varname, replace
When you would love to “concatenate” the numbers, tostring
becomes useful. For example, if you would love to create a unique id with statefips
(e.g., 10) + countyfips
(e.g., 002), you can tostring
them and then create the unique id by using generate
command (e.g., 10002). If you don’t tostring
, the generate command will return the sum of stsatefips
and countyfips
values (10+2 = 12), instead of 10002. The generate
command concatenates any text when they are coded as string variables in STATA.
tostring statefips, replace
tostring countyfips, replace
gen fips_id = statefips + countyfips
Here is another example of working with the different number of characters, in dealing with creating the id with tostring
.
The encode
Command: convert string to categorical variable
Mapping text to numbers is the idea behind encoding. When we have categorical data like “Low”, “Medium”, “High”, it might be useful to convert them into numbers like 1, 2, 3 for easier analysis. You can’t use replace options with encode
command. Here, the variable “sex” is better to be treated as a numeric (and categorical) variable.
encode varname, generate(newvarname)
By running encode
command, now the sex2 variable is Numeric with values (automatically assigned) and labels (in the original variable). You can also check if it is coded well in the data browser (browse
), by ordering them.
order newvarname, after(varname)
broswe
If you would love to replace
the original string variable with an encode
d variable, you need to encode
first, drop
the original variable, and rename
the new variable as an original variable name. It is a tedious task if you would love to do it with a lot of variables. You can use the loop for this.
foreach v of varlist var1 var2 var3 {
replace `v' = "" if `v' == "."
encode `v', generate (new`v')
drop `v'
rename new`v' `v'
}
// put the list of variable that you would love to enocde after varlist
The decode
Command
The reverse of encoding: Retrieving the original text data. If you’ve encoded a variable and need to revert to its original string format, decode
will convert it again.
decode varname, generate(newvarname)
The decode command will also convert the categorical variables into the string values.
Other commands
You can find more useful commands for string data cleaning in STATA, such as lower
, upper
, subinstr
, substr
, and strpos
here.
How to identify which command to use
- If string values are all numbers (e.g., id), use
destring
. - If string values are not numbers (e.g., sex), use
encode
. - To convert numeric variables to string variables, use
tostring
. - To revert categorical variables back to the string values, use
decode
.
Some other tips
- Always save your data before making changes:
save "filename.dta"
- Check consistency in categorical data after encoding.