[Stata] Univariate Statistics: Frequency, Central Tendency, and Variability (tab, tabstat, sum, graph bar, hist, graph box)
Summary
Statistical Method | Stata Code |
---|---|
Frequency analysis (%) | tab variable OR fre variable |
Measures of Central Tendency (Mean and Median) | sum variable, detail OR tabstat variable, stats(mean median) |
Measures of Central Tendency (Mode) | tab variable, sort OR fre variable, ascending |
Distribution(Skewness and Kurtosis) | sum variable, detail ORtabstat variable, stats(skewness kurtosis) |
Measures of dispersion (Standard Deviation, Variance, Range) | sum variable, detail OR tabstat variable, stats(variance sd range) |
Discrete variables
Compared to tab
command, fre
command returns the frequency table WITH the values with their labels.
tab varname
fre varname
tab varname, sort
fre varname, ascending
fre varname, descending
With tab varname sort
option, you can see the frequency sorted by the frequency order. You can also get the ascending or descending order results with fre varname, ascending
or descending
.
The mode is the first value in the frequency table in the descending order table! So here, the mode is 5 (60-69).
Continuous variables
For continuous variables, it’s better to use the central tendency and variability measures for descriptive statistics. With sum varname, detail
command, you can see mean, median, standard deviation, variance, skewness, and kurtosis.
sum varname, detail
tabstat varname, stat(mean median sd variance range skewness kurtosis)
Further, tabstat
allows us to put multiple variables at once, with specified statistics in the option.
Plots
Pie Chart: graph pie
graph pie, over(varname) plabel(_all name)
graph pie, over(varname) by(groupname) plabel(_all name)
graph pie, over(varname) by(groupname) plabel(_all name) scheme(white_tableau)
With by(groupname)
option, you can also plot pie charts by subgroup.
With scheme(schemename)
, you can also specify the color scheme of the chart. You can find the list of schemes and how to use them in this post.
Bar Graph: graph bar
Bar Graph | Histogram |
---|---|
Bar graph represents categorical data. | Histogram represents numerical data (discrete or continuous data). |
Equal space between every two consecutive bars. | No space between two consecutive bars. They should be attached to each other. |
Data can be arranged in any order. | Data is arranged in the order of range. |
The x-axis can represent anything. | The x-axis should represent only continuous data that is in terms of numbers. |
You can draw the graphs for the entire sample in the data or by the group (categorical variable) using by(groupname)
option. If you want to draw it only for the entire sample, just run it without by(groupname)
option.
graph bar (count), over(varname) by(groupname) ytitle(frequency)
graph hbar (count), over(varname) by(groupname) ytitle(frequency) // hbar for horizontal bar graph
graph bar (percent), over(varname) by(groupname) ytitle(frequency)
graph hbar (percent), over(varname) by(groupname) ytitle(frequency) // hbar for horizontal bar graph
With the percent option, you can have a graph that is based on the percentage rather than the frequency. It is better to compare the distribution across the groups.
Histogram: hist
hist varname, by(groupname)
Box Plot: graph box
A box plot is a type of plot that we can use to visualize the five-number summary of a dataset, which includes:
- Lower fence: smallest observed data value that is > P25 – 1.5*(P75 – P25).
- The first quartile
- The median
- The third quartile
- Upper fence: largest observed data value that is < P75 + 1.5*(P75 – P25).
graph box varname
graph box varname, over(groupname)
graph box varname1 varname2, over(groupname) nooutside // nooutside: excludes outliers
graph box varname1 varname2, over(groupname) horizontal nooutside
With the graph box varname
command, sometimes there are dots appearing outside of the upper/lower fences. These are the extreme values and you can remove them in the graph (not remove them in the data), by adding nooutside
option.
Tip. catplot
The catplot
command is a “wrapper” for graph hbar
, which allows us to compare the distribution of the variable by group intuitively. The percent()
option that allows you to specify what group percentages will be calculated over.
ssc install catplot
catplot varname, by(groupname) percent(groupname)
You can have it in more than one graph by putting two variables together in the command.
catplot varname groupname, by(groupname)
Using the following command, you can also change the color of the bars and assign the legend separately 🫡
catplot varname groupname, percent(groupname) ///
legend(label(1 "White") label(2 "Black") label (3 "Other")) ///
ysize(3) blabel(bar, format(%9.1f)) /// 1 decimal place
asyvars bar(1, color(purple)) bar(2, color(yellow)) name(g3, replace)
You can learn more about catplot
in the following post: https://sscc.wisc.edu/sscc/pubs/stata_bar_graphs.htm