[Stata] Univariate Statistics: Frequency, Central Tendency, and Variability (tab, tabstat, sum, graph bar, hist, graph box)

Summary

Statistical MethodStata Code
Frequency analysis (%)tab variable OR
fre variable
Measures of Central Tendency (Mean and Median)sum variable, detail OR
tabstat variable, stats(mean median)
Measures of Central Tendency (Mode)tab variable, sort OR
fre variable, ascending
Distribution(Skewness and Kurtosis)sum variable, detail OR
tabstat variable, stats(skewness kurtosis)
Measures of dispersion (Standard Deviation, Variance, Range)sum variable, detail OR
tabstat variable, stats(variance sd range)

Discrete variables

Compared to tab command, fre command returns the frequency table WITH the values with their labels.

Stata
tab varname 
fre varname 
Stata
tab varname, sort
fre varname, ascending
fre varname, descending

With tab varname sort option, you can see the frequency sorted by the frequency order. You can also get the ascending or descending order results with fre varname, ascending or descending.

The mode is the first value in the frequency table in the descending order table! So here, the mode is 5 (60-69).

Continuous variables

For continuous variables, it’s better to use the central tendency and variability measures for descriptive statistics. With sum varname, detail command, you can see mean, median, standard deviation, variance, skewness, and kurtosis.

Stata
sum varname, detail 
tabstat varname, stat(mean median sd variance range skewness kurtosis)

Further, tabstat allows us to put multiple variables at once, with specified statistics in the option.

Plots

Pie Chart: graph pie

Stata
graph pie, over(varname) plabel(_all name) 
graph pie, over(varname) by(groupname) plabel(_all name) 
graph pie, over(varname) by(groupname) plabel(_all name) scheme(white_tableau)

With by(groupname) option, you can also plot pie charts by subgroup.

With scheme(schemename), you can also specify the color scheme of the chart. You can find the list of schemes and how to use them in this post.

Bar Graph: graph bar

Bar GraphHistogram
Bar graph represents categorical data.Histogram represents numerical data (discrete or continuous data).
Equal space between every two consecutive bars.No space between two consecutive bars. They should be attached to each other.
Data can be arranged in any order.Data is arranged in the order of range.
The x-axis can represent anything.The x-axis should represent only continuous data that is in terms of numbers.

You can draw the graphs for the entire sample in the data or by the group (categorical variable) using by(groupname) option. If you want to draw it only for the entire sample, just run it without by(groupname) option.

Stata
graph bar (count), over(varname) by(groupname) ytitle(frequency)
graph hbar (count), over(varname) by(groupname) ytitle(frequency) // hbar for horizontal bar graph
Stata
graph bar (percent), over(varname) by(groupname) ytitle(frequency)
graph hbar (percent), over(varname) by(groupname) ytitle(frequency) // hbar for horizontal bar graph

With the percent option, you can have a graph that is based on the percentage rather than the frequency. It is better to compare the distribution across the groups.

Histogram: hist

Stata
hist varname, by(groupname)

Box Plot: graph box

box plot is a type of plot that we can use to visualize the five-number summary of a dataset, which includes:

  • Lower fence: smallest observed data value that is > P25 – 1.5*(P75 – P25).
  • The first quartile
  • The median
  • The third quartile
  • Upper fence: largest observed data value that is < P75 + 1.5*(P75 – P25).
Ref: Mathspace
Stata
graph box varname
graph box varname, over(groupname)
graph box varname1 varname2, over(groupname) nooutside // nooutside: excludes outliers 
graph box varname1 varname2, over(groupname) horizontal nooutside 

With the graph box varname command, sometimes there are dots appearing outside of the upper/lower fences. These are the extreme values and you can remove them in the graph (not remove them in the data), by adding nooutside option.

Tip. catplot

The catplot command is a “wrapper” for graph hbar, which allows us to compare the distribution of the variable by group intuitively. The percent() option that allows you to specify what group percentages will be calculated over.

Stata
ssc install catplot
catplot varname, by(groupname) percent(groupname)

You can have it in more than one graph by putting two variables together in the command.

Stata
catplot varname groupname, by(groupname)

Using the following command, you can also change the color of the bars and assign the legend separately 🫡

Stata
catplot varname groupname, percent(groupname)  ///
legend(label(1 "White") label(2 "Black") label (3 "Other")) ///
ysize(3) blabel(bar, format(%9.1f)) /// 1 decimal place
asyvars bar(1, color(purple)) bar(2, color(yellow)) name(g3, replace)

You can learn more about catplot in the following post: https://sscc.wisc.edu/sscc/pubs/stata_bar_graphs.htm

  • August 10, 2023