[Stata] Collapse command for aggregating data per variable (mean, median, sd, and count)

In this blog post, I will show you how to use the collapse command in Stata to create a new dataset that contains summary statistics of your original data. This can be useful when you want to aggregate your data by groups or categories, such as regions, years, or types.

What is the collapse command?

The collapse command in Stata is used to aggregate a dataset by collapsing it based on some summary statistics of a variable like mean, sum, median, percentile, standard error, etc. 

The collapse command has the following syntax:

Stata
collapse (stat) newvar=oldvar (stat) newvar=oldvar ... , by(varlist) options

where stat is the statistic you want to calculate, such as mean, sum, median, etc.; newvar is the name of the new variable that will store the result; oldvar is the name of the existing variable that you want to summarize; varlist is the list of variables that define the groups or categories over which you want to aggregate; and options are additional options that modify the behavior of the command.

Example: lifeexp dataset

To understand how a variable can be aggregated, let’s start by loading Stata’s web dataset, lifeexp.

Stata
webuse lifeexp, clear 

For instance, if you want to calculate the mean life expectancy and GNP per capita with their standard deviation by region, you can use the following command. This will create a new dataset with 7 variables, with group variable (region) and 2 for mean, 2 for sd, and 2 for count as required in the syntax.

You should write the variable name to be collapsed on the left side (lexp_mean), and then the variable name to be used for collapse on the right side (lexp). In general, I collapse the mean with standard deviation and count them together to take the number of observations and their variation into account.

Stata
collapse (mean) lexp_mean=lexp gnppc_mean=gnppc (sd) lexp_sd=lexp gnppc_sd=gnppc (count) lexp_count=lexp gnppc_count=gnppc, by(region)

Tips

  • Always save the dataset with save "filename.dta" before running collapse command. The collapse command will replace your dataset loaded.
  • You can put multiple groups in the by() options. For example, you can put by(year region) to see the change by year by region.
  • August 18, 2023