[Stata] Data cleaning 7: Working with string variables (destring, tostring, encode, and decode)

Data preparation is often said to occupy 80% of the data analysis process. Ensuring that your data is clean, accurate, and in the right format is crucial before performing any statistical analysis. For those using Stata, managing and cleaning string variables (text data) can initially seem challenging, but with several commands, it becomes a smooth process.

This blog post will delve into four Stata commands that will simplify the way you handle text data: destring, tostring, encode, and decode. Whether you’re a Stata newbie or need a quick refresher, this guide is here to assist you.

What are string variables?

String variables are essentially sequences of characters. They can contain anything from letters, numbers, and spaces to other special characters. In Stata, string variables are easily identifiable when you open the data browser or run codebook commands. Here, we are going to use the example by STATA “hbp2.dta”

Stata
webuse hbp2.dta

In the data browser, the string variables are in wine color, compared to the numeric variables in black colors.

When you run the command codebook, it will also show whether the variable is string or numeric.

Stata
codebook varname

The number after str in the type shows the maximum length of the characters in that specific variable (e.g., str6, str12 or str50).

Why can’t I directly analyze my string data?

  • While strings are flexible and can store a variety of data, Stata and many other statistical software prefer numerical data for data analysis. If you try to run the commands such as regression with string variables, it will return the error message. Hence, converting or managing string variables becomes important.

The destring Command

The destring command converts string variables that represent numbers into numeric variables. This is particularly useful when data is imported, and numbers are mistakenly read as strings, even though all the text in the variable are numbers.

For example, here, id includes only numbers in the values, so it would be better to convert them to the numeric variables, such as schno.

Stata
destring varname, replace // replace string variables in varlist with numeric variables
destring varname, generate(newvarname) // generate newvarname seperately

If it is successful, it will return the following message:

Tip. Contains non-numeric characters error

If there are characters within your string data that can’t be converted into numbers, Stata will show an error.

  • Solution 1: Use the ignore() option. For example, destring income, replace ignore("$,") if the text is formatted “$10,000”
  • Solution 2: Use the force option. It will destring it by converting the values with non-numeric characters into missing values.
Stata
destring varname, replace force 
destring varname, replace ignore(" ") // Remove the spaces in varname and convert it to a numeric variable, replacing the original string variable. You can put other text in ignore option.
The force option will convert non-numeric characters into missing values.

You can also check if it is converted to Numeric well, by using the command codebook.

The tostring Command

Converting numerical data to string format: Why and when? Sometimes, numbers are better treated as string values (like phone numbers or zip codes).

Stata
tostring varname, replace

When you would love to “concatenate” the numbers, tostring becomes useful. For example, if you would love to create a unique id with statefips (e.g., 10) + countyfips (e.g., 002), you can tostring them and then create the unique id by using generate command (e.g., 10002). If you don’t tostring, the generate command will return the sum of stsatefips and countyfips values (10+2 = 12), instead of 10002. The generate command concatenates any text when they are coded as string variables in STATA.

Stata
tostring statefips, replace
tostring countyfips, replace
gen fips_id = statefips + countyfips

Here is another example of working with the different number of characters, in dealing with creating the id with tostring.

https://twitter.com/toddrjones/status/1699897399416955350?s=20

The encode Command: convert string to categorical variable

Mapping text to numbers is the idea behind encoding. When we have categorical data like “Low”, “Medium”, “High”, it might be useful to convert them into numbers like 1, 2, 3 for easier analysis. You can’t use replace options with encode command. Here, the variable “sex” is better to be treated as a numeric (and categorical) variable.

Stata
encode varname, generate(newvarname)

By running encode command, now the sex2 variable is Numeric with values (automatically assigned) and labels (in the original variable). You can also check if it is coded well in the data browser (browse), by ordering them.

Stata
order newvarname, after(varname)
broswe

If you would love to replace the original string variable with an encoded variable, you need to encode first, drop the original variable, and rename the new variable as an original variable name. It is a tedious task if you would love to do it with a lot of variables. You can use the loop for this.

Stata
foreach v of varlist var1 var2 var3 {
    replace `v' = "" if `v' == "."
    encode `v', generate (new`v') 
    drop `v'
    rename new`v' `v'
}

// put the list of variable that you would love to enocde after varlist 

The decode Command

The reverse of encoding: Retrieving the original text data. If you’ve encoded a variable and need to revert to its original string format, decode will convert it again.

Stata
decode varname, generate(newvarname)

The decode command will also convert the categorical variables into the string values.

Other commands

You can find more useful commands for string data cleaning in STATA, such as lower, upper, subinstr, substr, and strpos here.

How to identify which command to use

  • If string values are all numbers (e.g., id), use destring.
  • If string values are not numbers (e.g., sex), use encode.
  • To convert numeric variables to string variables, use tostring.
  • To revert categorical variables back to the string values, use decode.

Some other tips

  • Always save your data before making changes: save "filename.dta"
  • Check consistency in categorical data after encoding.

  • June 7, 2023