[Stata] Collapse command for aggregating data per variable (mean, median, sd, and count)
In this blog post, I will show you how to use the collapse command in Stata to create a new dataset that contains summary statistics of your original data. This can be useful when you want to aggregate your data by groups or categories, such as regions, years, or types.
What is the collapse command?
The collapse command in Stata is used to aggregate a dataset by collapsing it based on some summary statistics of a variable like mean, sum, median, percentile, standard error, etc.
The collapse command has the following syntax:
collapse (stat) newvar=oldvar (stat) newvar=oldvar ... , by(varlist) options
where stat
is the statistic you want to calculate, such as mean, sum, median, etc.; newvar
is the name of the new variable that will store the result; oldvar
is the name of the existing variable that you want to summarize; varlist
is the list of variables that define the groups or categories over which you want to aggregate; and options
are additional options that modify the behavior of the command.
Example: lifeexp
dataset
To understand how a variable can be aggregated, let’s start by loading Stata’s web dataset, lifeexp
.
webuse lifeexp, clear
For instance, if you want to calculate the mean life expectancy and GNP per capita with their standard deviation by region, you can use the following command. This will create a new dataset with 7 variables, with group variable (region
) and 2 for mean, 2 for sd, and 2 for count as required in the syntax.
You should write the variable name to be collapsed on the left side (lexp_mean
), and then the variable name to be used for collapse on the right side (lexp
). In general, I collapse the mean with standard deviation and count them together to take the number of observations and their variation into account.
collapse (mean) lexp_mean=lexp gnppc_mean=gnppc (sd) lexp_sd=lexp gnppc_sd=gnppc (count) lexp_count=lexp gnppc_count=gnppc, by(region)
Tips
- Always save the dataset with
save "filename.dta"
before runningcollapse
command. The collapse command will replace your dataset loaded. - You can put multiple groups in the
by()
options. For example, you can putby(year region)
to see the change byyear
byregion
.